Author name cluster

Matthew E. Taylor

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

83 papers

2 author rows

AAAI Conference 2025 System Paper

An LLM-Guided Tutoring System for Social Skills Training

Michael Guevarra
Indronil Bhattacharjee
Srijita Das
Christabel Wayllace
Carrie Demmans Epp
Matthew E. Taylor
Alan Tay

Social skills training targets behaviors necessary for success in social interactions. However, traditional classroom training for such skills is often insufficient to teach effective communication — one-to-one interaction in real-world scenarios is preferred to lecture-style information delivery. This paper introduces a framework that allows instructors to collaborate with large language models to dynamically design realistic scenarios for students to communicate. Our framework uses these scenarios to enable student rehearsal, provide immediate feedback and visualize performance for both students and instructors. Unlike traditional intelligent tutoring systems, instructors can easily co-create scenarios with a large language model without technical skills. Additionally, the system generates new scenario branches in real time when existing options don't fit the student's response.

PDF Details DOI

AAMAS Conference 2025 Conference Paper

Boosting Robustness in Preference-Based Reinforcement Learning with Dynamic Sparsity

Calarina Muslimani
Bram Grooten
Deepak R. S. Mamillapalli
Mykola Pechenizkiy
Decebal C. Mocanu
Matthew E. Taylor

To integrate into human-centered environments, autonomous agents must learn from and adapt to humans in their native settings. Preference-based reinforcement learning (PbRL) can enable this by learning reward functions from human preferences. However, humans live in a world full of diverse information, most of which is irrelevant to completing any particular task. It then becomes essential that agents learn to focus on the subset of task-relevant state features. To that end, this work proposes R2N (Robust-to- Noise), the first PbRL algorithm that leverages principles of dynamic sparse training to learn robust reward models that can focus on task-relevant features. In experiments with a simulated teacher, we demonstrate that R2N can adapt the sparse connectivity of its neural networks to focus on task-relevant features, enabling R2N to significantly outperform several sparse training and PbRL algorithms across simulated robotic environments.

RLJ Journal 2025 Journal Article

Efficient Morphology-Aware Policy Transfer to New Embodiments

Michael Przystupa
Hongyao Tang
Glen Berseth
Mariano Phielipp
Santiago Miret
Martin Jägersand
Matthew E. Taylor

Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with \textit{parameter efficient finetuning} (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1\% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.

RLC Conference 2025 Conference Paper

Efficient Morphology-Aware Policy Transfer to New Embodiments

Michael Przystupa
Hongyao Tang
Glen Berseth
Mariano Phielipp
Santiago Miret
Martin Jägers
Matthew E. Taylor

Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with \textit{parameter efficient finetuning} (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1\% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.

AAMAS Conference 2025 Conference Paper

Empowering Generalization for Deep Reinforcement Learning via Symbolic Planning

Tianpei Yang
Srijita Das
Christabel Wayllace
Matthew E. Taylor

ICLR Conference 2025 Conference Paper

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Calarina Muslimani
Matthew E. Taylor

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the human-in-the-loop RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of human-in-the-loop RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Through extensive experiments with both simulated and human teachers, we find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance across a variety of simulated robotic tasks.

EWRL Workshop 2025 Workshop Paper

MaDi: Learning to Mask Distractions for Generalization in Visual Deep Reinforcement Learning

Bram Grooten
Tristan Tomilin
Gautham Vasan
Matthew E. Taylor
A. Rupam Mahmood
Meng Fang
Mykola Pechenizkiy
Decebal Constantin Mocanu

The visual world provides an abundance of information, but many input pixels received by agents often contain distracting stimuli. Autonomous agents need the ability to distinguish useful information from task-irrelevant perceptions, enabling them to generalize to unseen environments with new distractions. Existing works approach this problem using data augmentation or large auxiliary networks with additional loss functions. We introduce MaDi, a novel algorithm that learns to mask distractions by the reward signal only. In MaDi, the conventional actor-critic structure of deep reinforcement learning agents is complemented by a small third sibling, the Masker. This lightweight neural network generates a mask to determine what the actor and critic receive, such that they can focus on learning the task. We run experiments on the DeepMind Control Generalization Benchmark, the Distracting Control Suite, and a real UR5 Robotic Arm. Our algorithm improves the agent’s focus with useful masks, while its efficient Masker network only adds 0. 2% more parameters to the original structure, in contrast to previous work. MaDi consistently achieves generalization results better than or competitive to state-of-the-art methods.

ICML Conference 2025 Conference Paper

Model-Based Exploration in Monitored Markov Decision Processes

Alireza Kazemipour
Matthew E. Taylor
Michael H. Bowling

A tenet of reinforcement learning is that the agent always observes rewards. However, this is not true in many realistic settings, e. g. , a human observer may not always be available to provide rewards, sensors may be limited or malfunctioning, or rewards may be inaccessible during deployment. Monitored Markov decision processes (Mon-MDPs) have recently been proposed to model such settings. However, existing Mon-MDP algorithms have several limitations: they do not fully exploit the problem structure, cannot leverage a known monitor, lack worst-case guarantees for "unsolvable" Mon-MDPs without specific initialization, and offer only asymptotic convergence proofs. This paper makes three contributions. First, we introduce a model-based algorithm for Mon-MDPs that addresses these shortcomings. The algorithm employs two instances of model-based interval estimation: one to ensure that observable rewards are reliably captured, and another to learn the minimax-optimal policy. Second, we empirically demonstrate the advantages. We show faster convergence than prior algorithms in more than four dozen benchmarks, and even more dramatic improvements when the monitoring process is known. Third, we present the first finite-sample bound on performance. We show convergence to a minimax-optimal policy even when some rewards are never observable.

AAMAS Conference 2025 Conference Paper

Taming Multi-Agent Reinforcement Learning with Estimator Variance Reduction

Taher Jafferjee
Juliusz Ziomek
Tianpei Yang
Zipeng Dai
Jianhong Wang
Matthew E. Taylor
Kun Shao
Jun Wang

Multi-agent reinforcement learning (MARL) enables systems of autonomous agents to solve complex tasks from jointly gathered experiences of the environment. Many MARL algorithms perform centralized training (CT), often in a simulated environment, where at each time-step the critic makes use of a single sample of the agents’ joint-action for training. Yet, as agents update their policies during training, these single samples may poorly represent the agents’ joint-policy leading to high variance gradient estimates that hinder learning. In this paper, we examine the effect on MARL estimators of allowing the number of joint-action samples taken at each time-step to be greater than 1 in training. Our theoretical analysis shows that even modestly increasing the number of jointaction samples shown to the critic leads to TD updates that closely approximate the true expected value under the current joint-policy. In particular, we prove this reduces variance in value estimates similar to that of decentralized training while maintaining the learning benefits of CT. We describe how such a protocol can be seamlessly realized by sharing policy parameters between the agents during training and apply the technique to induce lower variance in estimates in MARL methods within a general apparatus which we call Performance Enhancing Reinforcement Learning Apparatus (PERLA). Lastly, we demonstrate PERLA’s performance improvements and estimator variance reduction capabilities in a range of environments including Multi-agent Mujoco, and StarCraft II. ∗Work was conducted while at Huawei R&D. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

IJCAI Conference 2025 Conference Paper

The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

Sheila Schoepp
Masoud Jafaripour
Yingyue Cao
Tianpei Yang
Fatemeh Abdollahi
Shadan Golestan
Zahin Sufiyan
Osmar R. Zaiane

Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Large Language Models (LLMs) and Vision-Language Models (VLMs) have recently emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. This survey reviews representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.

PDF Details DOI

RLC Conference 2025 Conference Paper

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

Calarina Muslimani
Kerrick Johnstonbaugh
Suyog Chandramouli
Serena Booth
W. Bradley Knox
Matthew E. Taylor

Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment --- assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an $11$--person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by $1. 5$x, was preferred by 82\% of users and increased the success rate of selecting reward functions that produced performant policies by 41\%.

RLJ Journal 2025 Journal Article

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

Calarina Muslimani
Kerrick Johnstonbaugh
Suyog Chandramouli
Serena Booth
W. Bradley Knox
Matthew E. Taylor

Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment --- assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an $11$--person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by $1.5$x, was preferred by 82\% of users and increased the success rate of selecting reward functions that produced performant policies by 41\%.

AAAI Conference 2024 Conference Paper

A Transfer Approach Using Graph Neural Networks in Deep Reinforcement Learning

Tianpei Yang
Heng You
Jianye Hao
Yan Zheng
Matthew E. Taylor

Transfer learning (TL) has shown great potential to improve Reinforcement Learning (RL) efficiency by leveraging prior knowledge in new tasks. However, much of the existing TL research focuses on transferring knowledge between tasks that share the same state-action spaces. Further, transfer from multiple source tasks that have different state-action spaces is more challenging and needs to be solved urgently to improve the generalization and practicality of the method in real-world scenarios. This paper proposes TURRET (Transfer Using gRaph neuRal nETworks), to utilize the generalization capabilities of Graph Neural Networks (GNNs) to facilitate efficient and effective multi-source policy transfer learning in the state-action mismatch setting. TURRET learns a semantic representation by accounting for the intrinsic property of the agent through GNNs, which leads to a unified state embedding space for all tasks. As a result, TURRET achieves more efficient transfer with strong generalization ability between different tasks and can be easily combined with existing Deep RL algorithms. Experimental results show that TURRET significantly outperforms other TL methods on multiple continuous action control tasks, successfully transferring across robots with different state-action spaces.

PDF Details DOI

TMLR Journal 2024 Journal Article

Conservative Evaluation of Offline Policy Learning

Hager Radi Abdelwahed
Josiah P. Hanna
Matthew E. Taylor

The world offers unprecedented amounts of data in real-world domains, from which we can develop successful decision-making systems. It is possible for reinforcement learning (RL) to learn control policies offline from such data but challenging to deploy an agent during learning in safety-critical domains. Offline RL learns from historical data without access to an environment. Therefore, we need a methodology for estimating how a newly-learned agent will perform when deployed in the real environment \emph{before} actually deploying it. To achieve this, we propose a framework for conservative evaluation of offline policy learning (CEOPL). We focus on being conservative so that the probability that our agent performs below a baseline is approximately $\delta$, where $\delta$ specifies how much risk we are willing to accept. In our setting, we assume access to a data stream, split into a train-set to learn an offline policy, and a test-set to estimate a lower-bound on the offline policy using off-policy evaluation with bootstrap confidence intervals. A lower-bound estimate allows us to decide when to deploy our learned policy with minimal risk of overestimation. We demonstrate CEOPL on a range of tasks as well as real-world medical data.

AAMAS Conference 2024 Conference Paper

GLIDE-RL: Grounded Language Instruction through DEmonstration in RL

Chaitanya Kharyal
Sai Krishna Gottipati
Tanmay Sinha
Srijita Das
Matthew E. Taylor

JAIR Journal 2024 Journal Article

Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities

Carl Orge Retzlaff
Srijita Das
Christabel Wayllace
Payam Mousavi
Mohammad Afshari
Tianpei Yang
Anna Saranti
Alessa Angerschmid

Artificial intelligence (AI) and especially reinforcement learning (RL) have the potential to enable agents to learn and perform tasks autonomously with superhuman performance. However, we consider RL as fundamentally a Human-in-the-Loop (HITL) paradigm, even when an agent eventually performs its task autonomously. In cases where the reward function is challenging or impossible to define, HITL approaches are considered particularly advantageous. The application of Reinforcement Learning from Human Feedback (RLHF) in systems such as ChatGPT demonstrates the effectiveness of optimizing for user experience and integrating their feedback into the training loop. In HITL RL, human input is integrated during the agent’s learning process, allowing iterative updates and fine-tuning based on human feedback, thus enhancing the agent’s performance. Since the human is an essential part of this process, we argue that human-centric approaches are the key to successful RL, a fact that has not been adequately considered in the existing literature. This paper aims to inform readers about current explainability methods in HITL RL. It also shows how the application of explainable AI (xAI) and specific improvements to existing explainability approaches can enable a better human-agent interaction in HITL RL for all types of users, whether for lay people, domain experts, or machine learning specialists. Accounting for the workflow in HITL RL and based on software and machine learning methodologies, this article identifies four phases for human involvement for creating HITL RL systems: (1) Agent Development, (2) Agent Learning, (3) Agent Evaluation, and (4) Agent Deployment. We highlight human involvement, explanation requirements, new challenges, and goals for each phase. We furthermore identify low-risk, high-return opportunities for explainability research in HITL RL and present long-term research goals to advance the field. Finally, we propose a vision of human-robot collaboration that allows both parties to reach their full potential and cooperate effectively.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

Calarina Muslimani
Matthew E. Taylor

To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL approaches allow agents to learn reward functions from human feedback. Despite recent successes, many of the HitL RL methods still require numerous human interactions to learn successful reward functions. To that end, this work introduces Sub-optimal Data Pretraining, SDP, a method that leverages reward-free, sub-optimal data to improve the feedback efficiency of HitL RL algorithms. We demonstrate that SDP can significantly improve over state-of-theart HitL RL algorithms in three DMControl environments.

IROS Conference 2024 Conference Paper

Local Linearity is All You Need (in Data-Driven Teleoperation)

Michael Przystupa
Gauthier Gidel
Matthew E. Taylor
Martin Jägersand
Justus H. Piater
Samuele Tosatto

One of the critical aspects of assistive robotics is to provide a control system of a high-dimensional robot from a low-dimensional user input (i. e. a 2D joystick). Data-driven teleoperation seeks to provide an intuitive user interface called an action map to map the low dimensional input to robot velocities from human demonstrations. Action maps are machine learning models trained on robotic demonstration data to map user input directly to desired movements as opposed to aspects of robot pose ("move to cup or pour content" vs. "move along x- or y-axis"). Many works have investigated nonlinear action maps with multi-layer perceptrons, but recent work suggests that local-linear neural approximations provide better control of the system. However, local linear models assume actions exist on a linear subspace and may not capture nuanced motions in training data. In this work, we hypothesize that local-linear neural networks are effective because they make the action map odd w. r. t. the user input, enhancing the intuitiveness of the controller. Based on this assumption, we propose two nonlinear means of encoding odd behavior that do not constrain the action map to a local linear function. However, our analysis reveals that these models effectively behave like local linear models for relevant mappings between user joysticks and robot movements. We support this claim in simulation, and show on a realworld use case that there is no statistical benefit of using non-linear maps, according to the users experience. These negative results suggest that further investigation into model architectures beyond local linear models may offer diminishing returns for improving user experience in data-driven teleoperation systems.

AAMAS Conference 2024 Conference Paper

MaDi: Learning to Mask Distractions for Generalization in Visual Deep Reinforcement Learning

Bram Grooten
Tristan Tomilin
Gautham Vasan
Matthew E. Taylor
A. Rupam Mahmood
Meng Fang
Mykola Pechenizkiy
Decebal Constantin Mocanu

The visual world provides an abundance of information, but many input pixels received by agents often contain distracting stimuli. Autonomous agents need the ability to distinguish useful information from task-irrelevant perceptions, enabling them to generalize to unseen environments with new distractions. Existing works approach this problem using data augmentation or large auxiliary networks with additional loss functions. We introduce MaDi, a novel algorithm that learns to mask distractions by the reward signal only. In MaDi, the conventional actor-critic structure of deep reinforcement learning agents is complemented by a small third sibling, the Masker. This lightweight neural network generates a mask to determine what the actor and critic receive, such that they can focus on learning the task. We run experiments on the DeepMind Control Generalization Benchmark, the Distracting Control Suite, and a real UR5 Robotic Arm. Our algorithm improves the agent’s focus with useful masks, while its efficient Masker network only adds 0. 2% more parameters to the original structure, in contrast to previous work. MaDi consistently achieves generalization results better than or competitive to state-of-the-art methods. 1

AAMAS Conference 2024 Conference Paper

Monitored Markov Decision Processes

Simone Parisi
Montaser Mohammedalamen
Alireza Kazemipour
Matthew E. Taylor
Michael Bowling

In reinforcement learning (RL), an agent learns to perform a task by interacting with an environment and receiving feedback (a numerical reward) for its actions. However, the assumption that rewards are always observable is often not applicable in real-world problems. For example, the agent may need to ask a human to supervise its actions or activate a monitoring system to receive feedback. There may even be a period of time before rewards become observable, or a period of time after which rewards are no longer given. In other words, there are cases where the environment generates rewards in response to the agent’s actions but the agent cannot observe them. In this paper, we formalize a novel but general RL framework — Monitored MDPs — where the agent cannot always observe rewards. We discuss the theoretical and practical consequences of this setting, show challenges raised even in toy environments, and propose algorithms to begin to tackle this novel setting. This paper introduces a powerful new formalism that encompasses both new and existing problems and lays the foundation for future research.

AAMAS Conference 2024 Conference Paper

PADDLE: Logic Program Guided Policy Reuse in Deep Reinforcement Learning

Hao Zhang
Tianpei Yang
Yan Zheng
Jianye Hao
Matthew E. Taylor

Learning new skills through previous experience is regular in human life, which is the core idea of Transfer Reinforcement Learning (TRL). TRL requires the agent to learn when and which source policy is the best to reuse as the target task’s policy and how to reuse the source policy. Most TRL methods learn, transfer, and reuse blackbox policies, which is hard to explain: 1) when to reuse, 2) which source policy is effective, and reduces transfer efficiency. In this paper, we propose a novel TRL method called ProgrAm guiDeD poLicy rEuse (PADDLE). PADDLE can measure the logic similarities between tasks and transfer knowledge which reflects the logic behind the target task. To achieve this, we propose a hybrid decision model that synthesizes high-level logic programs and learns low-level DRL policy to learn source tasks. Second, we propose a transferability metric that can measure the logic similarity between the target task and source tasks. Last, we combine it with the lowlevel policy similarity to select the appropriate source policy as the guiding policy for the target task. Experimental results show that PADDLE can effectively select the appropriate source tasks to guide learning on the target task, outperforming black-box TRL methods.

AAAI Conference 2024 Conference Paper

PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning

Jizhou Wu
Jianye Hao
Tianpei Yang
Xiaotian Hao
Yan Zheng
Weixun Wang
Matthew E. Taylor

Despite many breakthroughs in recent years, it is still hard for MultiAgent Reinforcement Learning (MARL) algorithms to directly solve complex tasks in MultiAgent Systems (MASs) from scratch. In this work, we study how to use Automatic Curriculum Learning (ACL) to reduce the number of environmental interactions required to learn a good policy. In order to solve a difficult task, ACL methods automatically select a sequence of tasks (i.e., curricula). The idea is to obtain maximum learning progress towards the final task by continuously learning on tasks that match the current capabilities of the learners. The key question is how to measure the learning progress of the learner for better curriculum selection. We propose a novel ACL framework, PrOgRessive mulTiagent Automatic curricuLum (PORTAL), for MASs. PORTAL selects curricula according to two critera: 1) How difficult is a task, relative to the learners’ current abilities? 2) How similar is a task, relative to the final task? By learning a shared feature space between tasks, PORTAL is able to characterize different tasks based on the distribution of features and select those that are similar to the final task. Also, the shared feature space can effectively facilitate the policy transfer between curricula. Experimental results show that PORTAL can train agents to master extremely hard cooperative tasks, which can not be achieved with previous state-of-the-art MARL algorithms.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Automatic Noise Filtering with Dynamic Sparse Training in Deep Reinforcement Learning

Bram Grooten
Ghada Sokar
Shibhansh Dohare
Elena Mocanu
Matthew E. Taylor
Mykola Pechenizkiy
Decebal Constantin Mocanu

Tomorrow’s robots will need to distinguish useful information from noise when performing different tasks. A household robot for instance may continuously receive a plethora of information about the home, but needs to focus on just a small subset to successfully execute its current chore. Filtering distracting inputs that contain irrelevant data has received little attention in the reinforcement learning literature. To start resolving this, we formulate a problem setting in reinforcement learning called the extremely noisy environment (ENE), where up to 99% of the input features are pure noise. Agents need to detect which features provide task-relevant information about the state of the environment. Consequently, we propose a new method termed Automatic Noise Filtering (ANF), which uses the principles of dynamic sparse training in synergy with various deep reinforcement learning algorithms. The sparse input layer learns to focus its connectivity on task-relevant features, such that ANF-SAC and ANF-TD3 outperform standard SAC and TD3 by a large margin, while using up to 95% fewer weights. Furthermore, we devise a transfer learning setting for ENEs, by permuting all features of the environment after 1M timesteps to simulate the fact that other information sources can become relevant as the world evolves. Again, ANF surpasses the baselines in final performance and sample complexity. Our code is available online. 1

IJCAI Conference 2023 Conference Paper

Can You Improve My Code? Optimizing Programs with Local Search

Fatemeh Abdollahi
Saqib Ameen
Matthew E. Taylor
Levi H. S. Lelis

This paper introduces a local search method for improving an existing program with respect to a measurable objective. Program Optimization with Locally Improving Search (POLIS) exploits the structure of a program, defined by its lines. POLIS improves a single line of the program while keeping the remaining lines fixed, using existing brute-force synthesis algorithms, and continues iterating until it is unable to improve the program's performance. POLIS was evaluated with a 27-person user study, where participants wrote programs attempting to maximize the score of two single-agent games: Lunar Lander and Highway. POLIS was able to substantially improve the participants' programs with respect to the game scores. A proof-of-concept demonstration on existing Stack Overflow code measures applicability in real-world problems. These results suggest that POLIS could be used as a helpful programming assistant for programming problems with measurable objectives.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Do As You Teach: A Multi-Teacher Approach to Self-Play in Deep Reinforcement Learning

Chaitanya Kharyal
Tanmay Sinha
Sai Krishna Gottipati
Fatemeh Abdollahi
Srijita Das
Matthew E. Taylor

AAMAS Conference 2023 Conference Paper

Hiking up that HILL with Cogment-Verse: Train & Operate Multi-agent Systems Learning from Humans

Sai Krishna Gottipati
Luong-Ha Nguyen
Clodéric Mars
Matthew E. Taylor

As more AI systems are deployed, humans are increasingly required to interact with them in multiple settings. However, such AI systems seldom learn from these interactions with humans, which provides an important opportunity to improve from human expertise and context awareness. Several recent results in the fields of reinforcement learning (RL) and human-in-the-loop learning (HILL) show that AI agents can perform better when humans are involved in their training process. Humans can provide rewards to the agent, demonstrate tasks, design curricula, or act directly in the environment, but these potential performance improvements also come with architectural, functional design, and engineering complexities. This paper discusses Cogment, a unifying open-source framework that introduces a formalism to support a variety of human(s)-agent(s) collaboration topologies and training approaches. Cogment addresses the complexity of training with humans within a production-ready platform. On top of Cogment, we introduce Cogment Verse a research platform dedicated to the research community to facilitate the implementation of HILL and Multi-Agent RL experiments. With these platforms, our end goal is to enable the generalization of intelligence ecosystems where AI agents and humans learn from each other and collaborate to address increasingly complex or sensitive use cases. The video demonstration is available at https: //youtu. be/v-K0DqIL9K0

AAMAS Conference 2023 Conference Paper

Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning

Sriram Ganapathi Subramanian
Matthew E. Taylor
Kate Larson
Mark Crowley

Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i. e. , advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.

TMLR Journal 2023 Journal Article

Learning Representations for Pixel-based Control: What Matters and Why?

Manan Tomar
Utkarsh Aashu Mishra
Amy Zhang
Matthew E. Taylor

Learning representations for pixel-based control has garnered significant attention recently in reinforcement learning. A wide range of methods have been proposed to enable efficient learning, leading to sample complexities similar to those in the full state setting. However, moving beyond carefully curated pixel data sets (centered crop, appropriate lighting, clear background, etc.) remains challenging. In this paper, we adopt a more difficult setting, incorporating background distractors, as a first step towards addressing this challenge. We start by exploring a simple baseline approach that does not use metric-based learning, data augmentations, world-model learning, or contrastive learning. We then analyze when and why previously proposed methods are likely to fail or reduce to the same performance as the baseline in this harder setting and why we should think carefully about extending such methods beyond the well curated environments. Our results show that finer categorization of benchmarks on the basis of characteristics like density of reward, planning horizon of the problem, presence of task-irrelevant components, etc., is crucial in evaluating algorithms. Based on these observations, we propose different metrics to consider when evaluating an algorithm on benchmark tasks. We hope such a data-centric view can motivate researchers to rethink representation learning when investigating how to best apply RL to real-world tasks. Code available: https://github.com/UtkarshMishra04/pixel-representations-RL

IJCAI Conference 2023 Conference Paper

Multi-Agent Advisor Q-Learning (Extended Abstract)

Sriram Ganapathi Subramanian
Matthew E. Taylor
Kate Larson
Mark Crowley

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. We provide a principled framework for incorporating action recommendations from online sub-optimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning-based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Multi-Robot Warehouse Optimization: Leveraging Machine Learning for Improved Performance

Mara Cairo
Bevin Eldaphonse
Payam Mousavi
Sahir
Sheikh Jubair
Matthew E. Taylor
Graham Doerksen
Nikolai Kummer

AAMAS Conference 2023 Conference Paper

PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning

Jizhou Wu
Tianpei Yang
Xiaotian Hao
Jianye Hao
Yan Zheng
Weixun Wang
Matthew E. Taylor

Despite many breakthroughs in recent years, it is still hard for MultiAgent Reinforcement Learning (MARL) algorithms to directly solve complex tasks in MultiAgent Systems (MASs) from scratch. In this work, we study how to use Automatic Curriculum Learning (ACL) to reduce the number of environmental interactions required to learn a good policy. In order to solve a difficult task, ACL methods automatically select a sequence of tasks (i. e. , curricula). The idea is to obtain maximum learning progress towards the final task by continuously learning on tasks that match the current capabilities of the learners. The key question is how to measure the learning progress of the learner for better curriculum selection. We propose a novel ACL framework, PrOgRessive mulTiagent Automatic curricuLum (PORTAL), for MASs. PORTAL selects curricula according to two criteria: 1) How difficult is a task, relative to the learners’ current abilities? 2) How similar is a task, relative to the final task? By learning a shared feature space between tasks, POR- TAL is able to characterize different tasks based on the distribution of features and select those that are similar to the final task. Also, the shared feature space can effectively facilitate the policy transfer between curricula. Experimental results show that PORTAL can train agents to master extremely hard cooperative tasks, which can not be achieved with previous state-of-the-art MARL algorithms. * Corresponding author. Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved.

TMLR Journal 2023 Journal Article

Reinforcement Teaching

Calarina Muslimani
Alex Lewandowski
Dale Schuurmans
Matthew E. Taylor
Jun Luo

Machine learning algorithms learn to solve a task, but are unable to improve their ability to learn. Meta-learning methods learn about machine learning algorithms and improve them so that they learn more quickly. However, existing meta-learning methods are either hand-crafted to improve one specific component of an algorithm or only work with differentiable algorithms. We develop a unifying meta-learning framework, called \textit{Reinforcement Teaching}, to improve the learning process of \emph{any} algorithm. Under Reinforcement Teaching, a teaching policy is learned, through reinforcement, to improve a student's learning algorithm. To learn an effective teaching policy, we introduce the \textit{parametric-behavior embedder} that learns a representation of the student's learnable parameters from its input/output behavior. We further use \textit{learning progress} to shape the teacher's reward, allowing it to more quickly maximize the student's performance. To demonstrate the generality of Reinforcement Teaching, we conduct experiments in which a teacher learns to significantly improve both reinforcement and supervised learning algorithms. Reinforcement Teaching outperforms previous work using heuristic reward functions and state representations, as well as other parameter representations.

AAMAS Conference 2023 Conference Paper

Two-Level Actor-Critic Using Multiple Teachers

Su Zhang
Srijita Das
Sriram Ganapathi Subramanian
Matthew E. Taylor

TMLR Journal 2023 Journal Article

Two-Level Actor-Critic Using Multiple Teachers

Su Zhang
Srijita Das
Sriram Ganapathi Subramanian
Matthew E. Taylor

Deep reinforcement learning has successfully allowed agents to learn complex behaviors for many tasks. However, a key limitation of current learning approaches is the sample-inefficiency problem, which limits performance of the learning agent. This paper considers how agents can benefit from improved learning via teachers' advice. In particular, we consider the setting with multiple sub-optimal teachers, as opposed to having a single near-optimal teacher. We propose a flexible two-level actor-critic algorithm where the high-level network learns to choose the best teacher in the current situation while the low-level network learns the control policy.

UAI Conference 2022 Conference Paper

Cross-domain adaptive transfer reinforcement learning based on state-action correspondence

Heng You
Tianpei Yang
Yan Zheng 0002
Jianye Hao
Matthew E. Taylor

Despite the impressive success achieved in various domains, deep reinforcement learning (DRL) is still faced with the sample inefficiency problem. Transfer learning (TL), which leverages prior knowledge from different but related tasks to accelerate the target task learning, has emerged as a promising direction to improve RL efficiency. The majority of prior work considers TL across tasks with the same state-action spaces, while transferring across domains with different state-action spaces is relatively unexplored. Furthermore, such existing cross-domain transfer approaches only enable transfer from a single source policy, leaving open the important question of how to best transfer from multiple source policies. This paper proposes a novel framework called Cross-domain Adaptive Transfer (CAT) to accelerate DRL. CAT learns the state-action correspondence from each source task to the target task and adaptively transfers knowledge from multiple source task policies to the target policy. CAT can be easily combined with existing DRL algorithms and experimental results show that CAT significantly accelerates learning and outperforms other cross-domain transfer methods on multiple continuous action control tasks.

AAAI Conference 2022 Conference Paper

Decentralized Mean Field Games

Sriram Ganapathi Subramanian
Matthew E. Taylor
Mark Crowley
Pascal Poupart

Multiagent reinforcement learning algorithms have not been widely adopted in large scale environments with many agents as they often scale poorly with the number of agents. Using mean field theory to aggregate agents has been proposed as a solution to this problem. However, almost all previous methods in this area make a strong assumption of a centralized system where all the agents in the environment learn the same policy and are effectively indistinguishable from each other. In this paper, we relax this assumption about indistinguishable agents and propose a new mean field system known as Decentralized Mean Field Games, where each agent can be quite different from others. All agents learn independent policies in a decentralized fashion, based on their local observations. We define a theoretical solution concept for this system and provide a fixed point guarantee for a Q-learning based algorithm in this system. A practical consequence of our approach is that we can address a ‘chicken-and-egg’ problem in empirical mean field reinforcement learning algorithms. Further, we provide Q-learning and actor-critic algorithms that use the decentralized mean field learning approach and give stronger performances compared to common baselines in this area. In our setting, agents do not need to be clones of each other and learn in a fully decentralized fashion. Hence, for the first time, we show the application of mean field learning methods in fully competitive environments, large-scale continuous action space environments, and other environments with heterogeneous agents. Importantly, we also apply the mean field method in a ride-sharing problem using a real-world dataset. We propose a decentralized solution to this problem, which is more practical than existing centralized training methods.

JAIR Journal 2022 Journal Article

Multi-Agent Advisor Q-Learning

Sriram Ganapathi Subramanian
Matthew E. Taylor
Kate Larson
Mark Crowley

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online suboptimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

PDF Details DOI

ICML Conference 2022 Conference Paper

PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

Pengyi Li 0001
Hongyao Tang
Tianpei Yang
Xiaotian Hao
Tong Sang
Yan Zheng 0002
Jianye Hao
Matthew E. Taylor

Learning to collaborate is critical in Multi-Agent Reinforcement Learning (MARL). Previous works promote collaboration by maximizing the correlation of agents’ behaviors, which is typically characterized by Mutual Information (MI) in different forms. However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration. To address this issue, we propose a novel MARL framework, called Progressive Mutual Information Collaboration (PMIC), for more effective MI-driven collaboration. PMIC uses a new collaboration criterion measured by the MI between global states and joint actions. Based on this criterion, the key idea of PMIC is maximizing the MI associated with superior collaborative behaviors and minimizing the MI associated with inferior ones. The two MI objectives play complementary roles by facilitating better collaborations while avoiding falling into sub-optimal ones. Experiments on a wide range of MARL benchmarks show the superior performance of PMIC compared with other algorithms.

AAMAS Conference 2021 Conference Paper

Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems

Yaodong Yang
Jun Luo
Ying Wen
Oliver Slumbers
Daniel Graves
Haitham Bou Ammar
Jun Wang
Matthew E. Taylor

Multiagent reinforcement learning (MARL) has achieved a remarkable amount of success in solving various types of video games. A cornerstone of this success is the auto-curriculum framework, which shapes the learning process by continually creating new challenging tasks for agents to adapt to, thereby facilitating the acquisition of new skills. In order to extend MARL methods to realworld domains outside of video games, we envision in this blue sky paper that maintaining a diversity-aware auto-curriculum is critical for successful MARL applications. Specifically, we argue that behavioural diversity is a pivotal, yet under-explored, component for real-world multiagent learning systems, and that significant work remains in understanding how to design a diversity-aware auto-curriculum. We list four open challenges for auto-curriculum techniques, which we believe deserve more attention from this community. Towards validating our vision, we recommend modelling realistic interactive behaviours in autonomous driving as an important test bed, and recommend the SMARTS/ULTRA benchmark.

AAMAS Conference 2021 Conference Paper

Partially Observable Mean Field Reinforcement Learning

Sriram Ganapathi Subramanian
Matthew E. Taylor
Mark Crowley
Pascal Poupart

Traditional multi-agent reinforcement learning algorithms are not scalable to environments with more than a few agents, since these algorithms are exponential in the number of agents. Recent research has introduced successful methods to scale multi-agent reinforcement learning algorithms to many agent scenarios using mean field theory. Previous work in this field assumes that an agent has access to exact cumulative metrics regarding the mean field behaviour of the system, which it can then use to take its actions. In this paper, we relax this assumption and maintain a distribution to model the uncertainty regarding the mean field of the system. We consider two different settings for this problem. In the first setting, only agents in a fixed neighbourhood are visible, while in the second setting, the visibility of agents is determined at random based on distances. For each of these settings, we introduce a 𝑄-learning based algorithm that can learn effectively. We prove that this 𝑄learning estimate stays very close to the Nash 𝑄-value (under a common set of assumptions) for the first setting. We also empirically show our algorithms outperform multiple baselines in three different games in the MAgents framework, which supports large environments with many agents learning simultaneously to achieve possibly distinct goals.

AAAI Conference 2021 Conference Paper

Towered Actor Critic For Handling Multiple Action Types In Reinforcement Learning For Drug Discovery

Sai Krishna Gottipati
Yashaswi Pathak
Boris Sattarov
Sahir
Rohan Nuttall
Mohammad Amini
Matthew E. Taylor
Sarath Chandar

Reinforcement learning (RL) has made significant progress in both abstract and real-world domains, but the majority of state-of-the-art algorithms deal only with monotonic actions. However, some applications require agents to reason over different types of actions. Our application simulates reactionbased molecule generation, used as part of the drug discovery pipeline, and includes both uni-molecular and bi-molecular reactions. This paper introduces a novel framework, towered actor critic (TAC), to handle multiple action types. The TAC framework is general in that it is designed to be combined with any existing RL algorithms for continuous action space. We combine it with TD3 to empirically obtain significantly better results than existing methods in the drug discovery setting. TAC is also applied to RL benchmarks in OpenAI Gym and results show that our framework can improve, or at least does not hurt, performance relative to standard TD3.

JMLR Journal 2020 Journal Article

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Sanmit Narvekar
Bei Peng
Matteo Leonetti
Jivko Sinapov
Matthew E. Taylor
Peter Stone

Reinforcement learning (RL) is a popular paradigm for addressing sequential decision tasks in which the agent has only limited environmental feedback. Despite many advances over the past three decades, learning in many domains still requires a large amount of interaction with the environment, which can be prohibitively expensive in realistic scenarios. To address this problem, transfer learning has been applied to reinforcement learning such that experience gained in one task can be leveraged when starting to learn the next, harder task. More recently, several lines of research have explored how tasks, or data samples themselves, can be sequenced into a curriculum for the purpose of learning a problem that may otherwise be too difficult to learn from scratch. In this article, we present a framework for curriculum learning (CL) in reinforcement learning, and use it to survey and classify existing CL methods in terms of their assumptions, capabilities, and goals. Finally, we use our framework to find open problems and suggest directions for future RL curriculum learning research. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

AAAI Conference 2020 Short Paper

Providing Uncertainty-Based Advice for Deep Reinforcement Learning Agents (Student Abstract)

Felipe Leno Da Silva
Pablo Hernandez-Leal
Bilal Kartal
Matthew E. Taylor

The sample-complexity of Reinforcement Learning (RL) techniques still represents a challenge for scaling up RL to unsolved domains. One way to alleviate this problem is to leverage samples from the policy of a demonstrator to learn faster. However, advice is normally limited, hence advice should ideally be directed to states where the agent is uncertain on the best action to be applied. In this work, we propose Requesting Conﬁdence-Moderated Policy advice (RCMP), an action-advising framework where the agent asks for advice when its uncertainty is high. We describe a technique to estimate the agent uncertainty with minor modiﬁcations in standard value-based RL methods. RCMP is shown to perform better than several baselines in the Atari Pong domain.

AAAI Conference 2020 Conference Paper

Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents

Felipe Leno Da Silva
Pablo Hernandez-Leal
Bilal Kartal
Matthew E. Taylor

Although Reinforcement Learning (RL) has been one of the most successful approaches for learning in sequential decision making problems, the sample-complexity of RL techniques still represents a major challenge for practical applications. To combat this challenge, whenever a competent policy (e. g. , either a legacy system or a human demonstrator) is available, the agent could leverage samples from this policy (advice) to improve sample-efﬁciency. However, advice is normally limited, hence it should ideally be directed to states where the agent is uncertain on the best action to execute. In this work, we propose Requesting Conﬁdence-Moderated Policy advice (RCMP), an action-advising framework where the agent asks for advice when its epistemic uncertainty is high for a certain state. RCMP takes into account that the advice is limited and might be suboptimal. We also describe a technique to estimate the agent uncertainty by performing minor modiﬁcations in standard value-function-based RL methods. Our empirical evaluations show that RCMP performs better than Importance Advising, not receiving advice, and receiving it at random states in Gridworld and Atari Pong scenarios.

JAAMAS Journal 2019 Journal Article

A survey and critique of multiagent deep reinforcement learning

Pablo Hernandez-Leal
Bilal Kartal
Matthew E. Taylor

Abstract Deep reinforcement learning (RL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent learning (MAL) scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. The primary goal of this article is to provide a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Additionally, we complement the overview with a broader analysis: (i) we revisit previous key components, originally presented in MAL and RL, and highlight how they have been adapted to multiagent deep reinforcement learning settings. (ii) We provide general guidelines to new practitioners in the area: describing lessons learned from MDRL works, pointing to recent benchmarks, and outlining open avenues of research. (iii) We take a more critical tone raising practical challenges of MDRL (e. g. , implementation and computational demands). We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists (e. g. , RL and MAL) in a joint effort to promote fruitful research in the multiagent community.

IJCAI Conference 2019 Conference Paper

Interactive Reinforcement Learning with Dynamic Reuse of Prior Knowledge from Human and Agent Demonstrations

Zhaodong Wang
Matthew E. Taylor

Reinforcement learning has enjoyed multiple impressive successes in recent years. However, these successes typically require very large amounts of data before an agent achieves acceptable performance. This paper focuses on a novel way of combating such requirements by leveraging existing (human or agent) knowledge. In particular, this paper leverages demonstrations, allowing an agent to quickly achieve high performance. This paper introduces the Dynamic Reuse of Prior (DRoP) algorithm, which combines the offline knowledge (demonstrations recorded before learning) with online confidence-based performance analysis. DRoP leverages the demonstrator's knowledge by automatically balancing between reusing the prior knowledge and the current learned policy, allowing the agent to outperform the original demonstrations. We compare with multiple state-of-the-art learning algorithms and empirically show that DRoP can achieve superior performance in two domains. Additionally, we show that this confidence measure can be used to selectively request additional demonstrations, significantly improving the learning performance of the agent.

IJCAI Conference 2019 Conference Paper

Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Kenny Young
Baoxiang Wang
Matthew E. Taylor

Reinforcement learning (RL) has had many successes, but significant hyperparameter tuning is commonly required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this --- most notably experience replay or the use of parallel actors. These techniques stabilize learning by making the RL problem more similar to the supervised setting. However, they come at the cost of moving away from the RL problem as it is typically formulated, that is, a single agent learning online without maintaining a large database of training examples. To address these issues, we propose Metatrace, a meta-gradient descent based algorithm to tune the step-size online. Metatrace leverages the structure of eligibility traces, and works for both tuning a scalar step-size and a respective step-size for each parameter. We empirically evaluate Metatrace for actor-critic on the Arcade Learning Environment. Results show Metatrace can speed up learning, and improve performance in non-stationary settings.

KER Journal 2019 Journal Article

Pre-training with non-expert human demonstration for deep reinforcement learning

Gabriel V. de la Cruz
Yunshu Du
Matthew E. Taylor

Abstract Deep reinforcement learning (deep RL) has achieved superior performance in complex sequential tasks by using deep neural networks as function approximators to learn directly from raw input images. However, learning directly from raw images is data inefficient. The agent must learn feature representation of complex states in addition to learning a policy. As a result, deep RL typically suffers from slow learning speeds and often requires a prohibitively large amount of training time and data to reach reasonable performance, making it inapplicable to real-world settings where data are expensive. In this work, we improve data efficiency in deep RL by addressing one of the two learning goals, feature learning. We leverage supervised learning to pre-train on a small set of non-expert human demonstrations and empirically evaluate our approach using the asynchronous advantage actor-critic algorithms in the Atari domain. Our results show significant improvements in learning speed, even when the provided demonstration is noisy and of low quality.

IJCAI Conference 2018 Conference Paper

Autonomously Reusing Knowledge in Multiagent Reinforcement Learning

Felipe Leno Da Silva
Matthew E. Taylor
Anna Helena Reali Costa

Autonomous agents are increasingly required to solve complex tasks; hard-coding behaviors has become infeasible. Hence, agents must learn how to solve tasks via interactions with the environment. In many cases, knowledge reuse will be a core technology to keep training times reasonable, and for that, agents must be able to autonomously and consistently reuse knowledge from multiple sources, including both their own previous internal knowledge and from other agents. In this paper, we provide a literature review of methods for knowledge reuse in Multiagent Reinforcement Learning. We define an important challenge problem for the AI community, survey the existent methods, and discuss how they can all contribute to this challenging problem. Moreover, we highlight gaps in the current literature, motivating "low-hanging fruit'' for those interested in the area. Our ambition is that this paper will encourage the community to work on this difficult and relevant research challenge.

IJCAI Conference 2018 Conference Paper

Improving Reinforcement Learning with Human Input

Matthew E. Taylor

Reinforcement learning (RL) has had many successes when learning autonomously. This paper and accompanying talk consider how to make use of a non-technical human participant, when available. In particular, we consider the case where a human could 1) provide demonstrations of good behavior, 2) provide online evaluative feedback, or 3) define a curriculum of tasks for the agent to learn on. In all cases, our work has shown such information can be effectively leveraged. After giving a high-level overview of this work, we will highlight a set of open questions and suggest where future work could be usefully focused.

KER Journal 2018 Journal Article

Leveraging human knowledge in tabular reinforcement learning: a study of human subjects

Ariel Rosenfeld
Moshe Cohen
Matthew E. Taylor
Sarit Kraus

Abstract Reinforcement learning (RL) can be extremely effective in solving complex, real-world problems. However, injecting human knowledge into an RL agent may require extensive effort and expertise on the human designer’s part. To date, human factors are generally not considered in the development and evaluation of possible RL approaches. In this article, we set out to investigate how different methods for injecting human knowledge are applied, in practice, by human designers of varying levels of knowledge and skill. We perform the first empirical evaluation of several methods, including a newly proposed method named State Action Similarity Solutions (SASS) which is based on the notion of similarities in the agent’s state–action space. Through this human study, consisting of 51 human participants, we shed new light on the human factors that play a key role in RL. We find that the classical reward shaping technique seems to be the most natural method for most designers, both expert and non-expert, to speed up RL. However, we further find that our proposed method SASS can be effectively and efficiently combined with reward shaping, and provides a beneficial alternative to using only a single-speedup method with minimal human designer effort overhead.

AAMAS Conference 2017 Conference Paper

An Exploration Strategy Facing Non-Stationary Agents (JAAMAS Extended Abstract)

Pablo Hernandez-Leal
Yusen Zhan
Matthew E. Taylor
L. Enrique Sucar
Enrique Munoz de Cote

The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. This work investigates how to design exploration strategies in non-stationary and adversarial environments. Our experimental setting uses a two agents strategic interaction scenario, where the opponent switches between different behavioral patterns. The agent’s objective is to learn a model of the opponent’s strategy to act optimally, despite non-determinism and stochasticity. Our contribution is twofold. First, we present drift exploration as a strategy for switch detection. Second, we propose a new algorithm called R-max# that reasons and acts in terms of two objectives: 1) to maximize utilities in the short term while learning and 2) eventually explore implicitly looking for opponent behavioral changes. We provide theoretical results showing that R-max# is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity.

AAMAS Conference 2017 Conference Paper

Curriculum Design for Machine Learners in Sequential Decision Tasks

Bei Peng
James MacGlashan
Robert Loftin
Michael L. Littman
David L. Roberts
Matthew E. Taylor

Existing machine-learning work has shown that algorithms can benefit from curricula—learning first on simple examples before moving to more difficult examples. While most existing work on curriculum learning focuses on developing automatic methods to iteratively select training examples with increasing difficulty tailored to the current ability of the learner, relatively little attention has been paid to the ways in which humans design curricula. We argue that a better understanding of the human-designed curricula could give us insights into the development of new machinelearning algorithms and interfaces that can better accommodate machine- or human-created curricula. Our work addresses this emerging and vital area empirically, taking an important step to characterize the nature of human-designed curricula relative to the space of possible curricula and the performance benefits that may (or may not) occur.

AAMAS Conference 2017 Conference Paper

Detecting Switches Against Non-Stationary Opponents (JAAMAS Extended Abstract)

Pablo Hernandez-Leal
Yusen Zhan
Matthew E. Taylor
L. Enrique Sucar
Enrique Munoz de Cote

Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiple agent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in this cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This paper introduces DriftER, an algorithm that 1) learns a model of the opponent, 2) uses that to obtain an optimal policy and then 3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results in normal form games and then in a more realistic scenario, the Power TAC simulator.

IJCAI Conference 2017 Conference Paper

Improving Reinforcement Learning with Confidence-Based Demonstrations

Zhaodong Wang
Matthew E. Taylor

Reinforcement learning has had many successes, but in practice it often requires significant amounts of data to learn high-performing policies. One common way to improve learning is to allow a trained (source) agent to assist a new (target) agent. The goals in this setting are to 1) improve the target agent's performance, relative to learning unaided, and 2) allow the target agent to outperform the source agent. Our approach leverages source agent demonstrations, removing any requirements on the source agent's learning algorithm or representation. The target agent then estimates the source agent's policy and improves upon it. The key contribution of this work is to show that leveraging the target agent's uncertainty in the source agent's policy can significantly improve learning in two complex simulated domains, Keepaway and Mario.

ICML Conference 2017 Conference Paper

Interactive Learning from Policy-Dependent Human Feedback

James MacGlashan
Mark K. Ho
Robert Tyler Loftin
Bei Peng 0001
Guan Wang
David L. Roberts 0001
Matthew E. Taylor
Michael L. Littman

This paper investigates the problem of interactively learning behaviors communicated by a human teacher using positive and negative feedback. Much previous work on this problem has made the assumption that people provide feedback for decisions that is dependent on the behavior they are teaching and is independent from the learner’s current policy. We present empirical results that show this assumption to be false—whether human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy. Based on this insight, we introduce Convergent Actor-Critic by Humans (COACH), an algorithm for learning from policy-dependent feedback that converges to a local optimum. Finally, we demonstrate that COACH can successfully learn multiple behaviors on a physical robot.

IJCAI Conference 2017 Conference Paper

Leveraging Human Knowledge in Tabular Reinforcement Learning: A Study of Human Subjects

Ariel Rosenfeld
Matthew E. Taylor
Sarit Kraus

Reinforcement Learning (RL) can be extremely effective in solving complex, real-world problems. However, injecting human knowledge into an RL agent may require extensive effort on the human designer's part. To date, human factors are generally not considered in the development and evaluation of possible approaches. In this paper, we propose and evaluate a novel method, based on human psychology literature, which we show to be both effective and efficient, for both expert and non-expert designers, in injecting human knowledge for speeding up tabular RL.

AAMAS Conference 2017 Conference Paper

Speeding up Tabular Reinforcement Learning Using State-Action Similarities

Ariel Rosenfeld
Matthew E. Taylor
Sarit Kraus

AAMAS Conference 2016 Conference Paper

A Bayesian Approach for Learning and Tracking Switching, Non-Stationary Opponents (Extended Abstract)

Pablo Hernandez-Leal
Benjamin Rosman
Matthew E. Taylor
L. Enrique Sucar
Enrique Munoz de Cote

In many situations, agents are required to use a set of strategies (behaviors) and switch among them during the course of an interaction. This work focuses on the problem of recognizing the strategy used by an agent within a small number of interactions. We propose using a Bayesian framework to address this problem. In this paper we extend Bayesian Policy Reuse to adversarial settings where opponents switch from one stationary strategy to another. Our extension enables online learning of new models when the learning agent detects that the current policies are not performing optimally. Experiments presented in repeated games show that our approach yields better performance than state-of-the-art approaches in terms of average rewards.

AAMAS Conference 2016 Conference Paper

A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans

Bei Peng
James MacGlashan
Robert Loftin
Michael L. Littman
David L. Roberts
Matthew E. Taylor

As robots become pervasive in human environments, it is important to enable users to effectively convey new skills without programming. Most existing work on Interactive Reinforcement Learning focuses on interpreting and incorporating non-expert human feedback to speed up learning; we aim to design a better representation of the learning agent that is able to elicit more natural and effective communication between the human trainer and the learner, while treating human feedback as discrete communication that depends probabilistically on the trainer’s target policy. This work entails a user study where participants train a virtual agent to accomplish tasks by giving reward and/or punishment in a variety of simulated environments. We present results from 60 participants to show how a learner can ground natural language commands and adapt its action execution speed to learn more efficiently from human trainers. The agent’s action execution speed can be successfully modulated to encourage more explicit feedback from a human trainer in areas of the state space where there is high uncertainty. Our results show that our novel adaptive speed agent dominates different fixed speed agents on several measures of performance. Additionally, we investigate the impact of instructions on user performance and user preference in training conditions.

AAMAS Conference 2016 Conference Paper

Learning from Demonstration for Shaping through Inverse Reinforcement Learning

Halit Bener Suay
Tim Brys
Matthew E. Taylor
Sonia Chernova

Model-free episodic reinforcement learning problems define the environment reward with functions that often provide only sparse information throughout the task. Consequently, agents are not given enough feedback about the fitness of their actions until the task ends with success or failure. Previous work addresses this problem with reward shaping. In this paper we introduce a novel approach to improve modelfree reinforcement learning agents’ performance with a three step approach. Specifically, we collect demonstration data, use the data to recover a linear function using inverse reinforcement learning and we use the recovered function for potential-based reward shaping. Our approach is model-free and scalable to high dimensional domains. To show the scalability of our approach we present two sets of experiments in a two dimensional Maze domain, and the 27 dimensional Mario AI domain. We compare the performance of our algorithm to previously introduced reinforcement learning from demonstration algorithms. Our experiments show that our approach outperforms the state-of-the-art in cumulative reward, learning rate and asymptotic performance.

IROS Conference 2016 Conference Paper

Lifelong learning for disturbance rejection on mobile robots

David Isele
José-Marcio Luna
Eric Eaton
Gabriel Victor de la Cruz
James Irwin
Brandon Kallaher
Matthew E. Taylor

No two robots are exactly the same—even for a given model of robot, different units will require slightly different controllers. Furthermore, because robots change and degrade over time, a controller will need to change over time to remain optimal. This paper leverages lifelong learning in order to learn controllers for different robots. In particular, we show that by learning a set of control policies over robots with different (unknown) motion models, we can quickly adapt to changes in the robot, or learn a controller for a new robot with a unique set of disturbances. Furthermore, the approach is completely model-free, allowing us to apply this method to robots that have not, or cannot, be fully modeled.

IJCAI Conference 2016 Conference Paper

Theoretically-Grounded Policy Advice from Multiple Teachers in Reinforcement Learning Settings with Applications to Negative Transfer

Yusen Zhan
Haitham Bou Ammar
Matthew E. Taylor

Policy advice is a transfer learning method where a student agent is able to learn faster via advice from a teacher. However, both this and other reinforcement learning transfer methods have little theoretical analysis. This paper formally defines a setting where multiple teacher agents can provide advice to a student and introduces an algorithm to leverage both autonomous exploration and teacher's advice. Our regret bounds justify the intuition that good teachers help while bad teachers hurt. Using our formalization, we are also able to quantify, for the first time, when negative transfer can occur within such a reinforcement learning setting.

IJCAI Conference 2015 Conference Paper

Reinforcement Learning from Demonstration through Shaping

Tim Brys
Anna Harutyunyan
Halit Bener Suay
Sonia Chernova
Matthew E. Taylor
Ann Now
eacute;

Reinforcement learning describes how a learning agent can achieve optimal behaviour based on interactions with its environment and reward feedback. A limiting factor in reinforcement learning as employed in artificial intelligence is the need for an often prohibitively large number of environment samples before the agent reaches a desirable level of performance. Learning from demonstration is an approach that provides the agent with demonstrations by a supposed expert, from which it should derive suitable behaviour. Yet, one of the challenges of learning from demonstration is that no guarantees can be provided for the quality of the demonstrations, and thus the learned behavior. In this paper, we investigate the intersection of these two approaches, leveraging the theoretical guarantees provided by reinforcement learning, and using expert demonstrations to speed up this learning by biasing exploration through a process called reward shaping. This approach allows us to leverage human input without making an erroneous assumption regarding demonstration optimality. We show experimentally that this approach requires significantly fewer demonstrations, is more robust against suboptimality of demonstrations, and achieves much faster learning than the recently developed HAT algorithm.

ICML Conference 2014 Conference Paper

Online Multi-Task Learning for Policy Gradient Methods

Haitham Bou-Ammar
Eric Eaton
Paul Ruvolo
Matthew E. Taylor

Policy gradient algorithms have shown considerable recent success in solving high-dimensional sequential decision making tasks, particularly in robotics. However, these methods often require extensive experience in a domain to achieve high performance. To make agents more sample-efficient, we developed a multi-task policy gradient method to learn decision making tasks consecutively, transferring knowledge between tasks to accelerate learning. Our approach provides robust theoretical guarantees, and we show empirically that it dramatically accelerates learning on a variety of dynamical systems, including an application to quadrotor control.

ECAI Conference 2014 Conference Paper

Using Ensemble Techniques and Multi-Objectivization to Solve Reinforcement Learning Problems

Tim Brys
Matthew E. Taylor
Ann Nowé

Recent work on multi-objectivization has shown how a single-objective reinforcement learning problem can be turned into a multi-objective problem with correlated objectives, by providing multiple reward shaping functions. The information contained in these correlated objectives can be exploited to solve the base, single-objective problem faster and better, given techniques specifically aimed at handling such correlated objectives. In this paper, we identify ensemble techniques as a set of methods that is suitable to solve multi-objectivized reinforcement learning problems. We empirically demonstrate their use on the Pursuit domain.

AAMAS Conference 2011 Conference Paper

ESCAPES - Evacuation Simulation with Children, Authorities, Parents, Emotions, and Social comparison

Jason Tsai
Natalie Fridman
Emma Bowring
Matthew Brown
Shira Epstein
Gal A. Kaminka
Stacy Marsella
Andrew Ogden

In creating an evacuation simulation for training and planning, realistic agents that reproduce known phenomenon are required. Evacuation simulation in the airport domain requires additional features beyond most simulations, including the unique behaviors of first-time visitors who have incomplete knowledge of the area and families that do not necessarily adhere to often-assumed pedestrian behaviors. Evacuation simulations not customized for the airport domain do not incorporate the factors important to it, leading to inaccuracies when applied to it. In this paper, we describe ESCAPES, a multiagent evacuation simulation tool that incorporates four key features: (i) different agent types; (ii) emotional interactions; (iii) informational interactions; (iv) behavioral interactions. Our simulator reproduces phenomena observed in existing studies on evacuation scenarios and the features we incorporate substantially impact escape time. We use ESCAPES to model the International Terminal at Los Angeles International Airport (LAX) and receive high praise from security officials.

AAMAS Conference 2011 Conference Paper

Integrating Reinforcement Learning with Human Demonstrations of Varying Ability

Matthew E. Taylor
Halit Bener Suay
Sonia Chernova

This work introduces Human-Agent Transfer (HAT), an algorithm that combines transfer learning, learning from demonstration and reinforcement learning to achieve rapid learning and high performance in complex domains. Using experiments in a simulated robot soccer domain, we show that human demonstrations transferred into a baseline policy for an agent and refined using reinforcement learning significantly improve both learning time and policy performance. Our evaluation compares three algorithmic approaches to incorporating demonstration rule summaries into transfer learning, and studies the impact of demonstration quality and quantity, as well as the effect of combining demonstrations from multiple teachers. Our results show that all three transfer methods lead to statistically significant improvement in performance over learning without demonstration. The best performance was achieved by combining the best demonstrations from two teachers.

AAMAS Conference 2011 Conference Paper

Metric Learning for Reinforcement Learning Agents

Matthew E. Taylor
Brian Kulis
Fei Sha

A key component of any reinforcement learning algorithm is the underlying representation used by the agent. While reinforcement learning (RL) agents have typically relied on hand-coded state representations, there has been a growing interest in learning this representation. While inputs to an agent are typically fixed (i. e. , state variables represent sensors on a robot), it is desirable to automatically determine the optimal relative scaling of such inputs, as well as to diminish the impact of irrelevant features. This work introduces HOLLER, a novel distance metric learning algorithm, and combines it with an existing instance-based RL algorithm to achieve precisely these goals. The algorithms' success is highlighted via empirical measurements on a set of six tasks within the mountain car domain.

EUMAS Conference 2011 Conference Paper

Reinforcement Learning Transfer Using a Sparse Coded Inter-task Mapping

Haitham Bou-Ammar
Matthew E. Taylor
Karl Tuyls
Gerhard Weiß 0001

Abstract Reinforcement learning agents can successfully learn in a variety of difficult tasks. A fundamental problem is that they may learn slowly in complex environments, inspiring the development of speedup methods such as transfer learning. Transfer improves learning by reusing learned behaviors in similar tasks, usually via an inter-task mapping, which defines how a pair of tasks are related. This paper proposes a novel transfer learning technique to autonomously construct an inter-task mapping by using a novel combinations of sparse coding, sparse projection learning, and sparse pseudo-input gaussian processes. Experiments show successful transfer of information between two very different domains: the mountain car and the pole swing-up task. This paper empirically shows that the learned inter-task mapping can be used to successfully (1) improve the performance of a learned policy on a fixed number of samples, (2) reduce the learning times needed by the algorithms to converge to a policy on a fixed number of samples, and (3) converge faster to a near-optimal policy given a large amount of samples.

AAMAS Conference 2011 Conference Paper

Teamwork in Distributed POMDPs: Execution-time Coordination Under Model Uncertainty

Jun-young Kwak
Rong Yang
Zhengyu Yin
Matthew E. Taylor
Milind Tambe

Despite their worst-case NEXP-complete planning complexity, DEC-POMDPs remain a popular framework for multiagent teamwork. This paper introduces effective teamwork under model uncertainty (i. e. , potentially inaccurate transition and observation functions) as a novel challenge for DEC-POMDPs and presents MODERN, the first execution-centric framework for DEC-POMDPs explicitly motivated by addressing such model uncertainty. MODERN's shift of coordination reasoning from planning-time to execution-time avoids the high cost of computing optimal plans whose promised quality may not be realized in practice. There are three key ideas in MODERN: (i) it maintains an exponentially smaller model of other agents' beliefs and actions than in previous work and then further reduces the computation-time and space expense of this model via bounded pruning; (ii) it reduces execution-time computation by exploiting BDI theories of teamwork, and limits communication to key trigger points; and (iii) it limits its decision-theoretic reasoning about communication to trigger points and uses a systematic markup to encourage extra communication at these points - thus reducing uncertainty among team members at trigger points.

EWRL Workshop 2011 Conference Paper

Transfer Learning via Multiple Inter-task Mappings

Anestis Fachantidis
Ioannis Partalas
Matthew E. Taylor
Ioannis P. Vlahavas

Abstract In this paper we investigate using multiple mappings for transfer learning in reinforcement learning tasks. We propose two different transfer learning algorithms that are able to manipulate multiple inter-task mappings for both model-learning and model-free reinforcement learning algorithms. Both algorithms incorporate mechanisms to select the appropriate mappings, helping to avoid the phenomenon of negative transfer. The proposed algorithms are evaluated in the Mountain Car and Keepaway domains. Experimental results show that the use of multiple inter-task mappings can significantly boost the performance of transfer learning methodologies, relative to using a single mapping or learning without transfer.

JAAMAS Journal 2009 Journal Article

Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

Shimon Whiteson
Matthew E. Taylor
Peter Stone

Abstract Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address this shortcoming by presenting results of empirical comparisons between Sarsa and NEAT, two representative methods, in mountain car and keepaway, two benchmark reinforcement learning tasks. In each task, the methods are evaluated in combination with both linear and nonlinear representations to determine their best configurations. In addition, this article tests two specific hypotheses about the critical factors contributing to these methods’ relative performance: (1) that sensor noise reduces the final performance of Sarsa more than that of NEAT, because Sarsa’s learning updates are not reliable in the absence of the Markov property and (2) that stochasticity, by introducing noise in fitness estimates, reduces the learning speed of NEAT more than that of Sarsa. Experiments in variations of mountain car and keepaway designed to isolate these factors confirm both these hypotheses.

ICAPS Conference 2009 Conference Paper

Exploiting Coordination Locales in Distributed POMDPs via Social Model Shaping

Pradeep Varakantham
Jun-young Kwak
Matthew E. Taylor
Janusz Marecki
Paul Scerri
Milind Tambe

Distributed POMDPs provide an expressive framework for modeling multiagent collaboration problems, but NEXP-Complete complexity hinders their scalability and application in real-world domains. This paper introduces a subclass of distributed POMDPs, and TREMOR, an algorithm to solve such distributed POMDPs. The primary novelty of TREMOR is that agents plan individually with a single agent POMDP solver and use social model shaping to implicitly coordinate with other agents. Experiments demonstrate that TREMOR can provide solutions orders of magnitude faster than existing algorithms while achieving comparable, or even superior, solution quality.

JMLR Journal 2009 Journal Article

Transfer Learning for Reinforcement Learning Domains: A Survey

Matthew E. Taylor
Peter Stone

The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

AAAI Conference 2007 Short Paper

Autonomous Inter-Task Transfer in Reinforcement Learning Domains

Matthew E. Taylor

ICML Conference 2007 Conference Paper

Cross-domain transfer for reinforcement learning

Matthew E. Taylor
Peter Stone 0001

AAMAS Conference 2007 Conference Paper

IFSA: Incremental Feature-Set Augmentation for Reinforcement Learning Tasks

Mazda Ahmadi
Matthew E. Taylor
Peter Stone

Reinforcement learning is a popular and successful framework for many agent-related problems because only limited environmental feedback is necessary for learning. While many algorithms exist to learn effective policies in such problems, learning is often used to solve real world problems, which typically have large state spaces, and therefore suffer from the "curse of dimensionality. " One effectivemethod for speeding-up reinforcement learning algorithms is to leverage expert knowledge. In this paper, we propose a method for dynamically augmenting the agent's feature set in order to speed up value-function-based reinforcement learning. The domain expert divides the feature set into a series of subsets such that a novel problem concept can be learned from each successive subset. Domain knowledge is also used to order the feature subsets in order of their importance for learning. Our algorithm uses the ordered feature subsets to learn tasks significantly faster than if the entire feature set is used from the start. Incremental Feature-Set Augmentation (IFSA) is fully implemented and tested in three different domains: Gridworld, Blackjack and RoboCup Soccer Keepaway. All experiments show that IFSA can significantly speed up learning and motivates the applicability of this novel RL method.

AAMAS Conference 2007 Conference Paper

Towards Reinforcement Learning Representation Transfer

Matthew E. Taylor
Peter Stone

Transfer learning problems are typically framed as leveraging knowledge learned on a source task to improve learning on a related, but different, target task. Current transfer methods are able to successfully transfer knowledge between agents in different reinforcement learning tasks, reducing the time needed to learn the target. However, the complimentary task of representation transfer, i. e. transferring knowledge between agents with different internal representations, has not been well explored. The goal in both types of transfer problems is the same: reduce the time needed to learn the target with transfer, relative to learning the target without transfer. This work introduces one such representation transfer algorithm which is implemented in a complex multiagent domain. Experiments demonstrate that transferring the learned knowledge between different representations is both possible and beneficial.

JMLR Journal 2007 Journal Article

Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Matthew E. Taylor
Peter Stone
Yaxin Liu

Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping ( TVITM ), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

AAMAS Conference 2007 Conference Paper

Transfer via Inter-Task Mappings in Policy Search Reinforcement Learning

Matthew E. Taylor
Shimon Whiteson
Peter Stone

The ambitious goal of transfer learning is to accelerate learning on a target task after training on a different, but related, source task. While many past transfer methods have focused on transferring value-functions, this paper presents a method for transferring policies across tasks with different state and action spaces. In particular, this paper utilizes transfer via inter-task mappings for policy search methods (TVITM-PS) to construct a transfer functional that translates a population of neural network policies trained via policy search from a source task to a target task. Empirical results in robot soccer Keepaway and Server Job Scheduling show that TVITM-PS can markedly reduce learning time when full inter-task mappings are available. The results also demonstrate that TVITMPS still succeeds when given only incomplete inter-task mappings. Furthermore, we present a novel method for learning such mappings when they are not available, and give results showing they perform comparably to hand-coded mappings.

AAAI Conference 2006 Short Paper

Inter-Task Action Correlation for Reinforcement Learning Tasks

Matthew E. Taylor

None - 2 page paper is an abstract

AAAI Conference 2005 Conference Paper

Value Functions for RL-Based Behavior Transfer: A Comparative Study

Matthew E. Taylor

Temporal difference (TD) learning methods (Sutton & Barto 1998) have become popular reinforcement learning techniques in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but have often been found slow in practice. This paper presents methods for further generalizing across tasks, thereby speeding up learning, via a novel form of behavior transfer. We compare learning on a complex task with three function approximators, a CMAC, a neural network, and an RBF, and demonstrate that behavior transfer works well with all three. Using behavior transfer, agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup-soccer keepaway domain.