Arrow Research search

Author name cluster

Tim Brys

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

KER Journal 2019 Journal Article

Introspective Q -learning and learning from demonstration

  • Mao Li
  • Tim Brys
  • Daniel Kudenko

Abstract One challenge faced by reinforcement learning (RL) agents is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping can help to resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge into the learning through a potential function. Past work on reinforcement learning from demonstration (RLfD) directly mapped (sub-optimal) human expert demonstration to a potential function, which can speed up RL. In this paper we propose an introspective RL agent that significantly further speeds up the learning. An introspective RL agent records its state–action decisions and experience during learning in a priority queue. Good quality decisions, according to a Monte Carlo estimation, will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up RL via reward shaping. A human expert’s demonstration can be used to initialize the priority queue before the learning process starts. Experimental validation in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain shows that our approach significantly outperforms non-introspective RL and state-of-the-art approaches in RLfD in both domains.

AAAI Conference 2018 Conference Paper

Adapting to Concept Drift in Credit Card Transaction Data Streams Using Contextual Bandits and Decision Trees

  • Dennis Soemers
  • Tim Brys
  • Kurt Driessens
  • Mark Winands
  • Ann Nowé

Credit card transactions predicted to be fraudulent by automated detection systems are typically handed over to human experts for verification. To limit costs, it is standard practice to select only the most suspicious transactions for investigation. We claim that a trade-off between exploration and exploitation is imperative to enable adaptation to changes in behavior (concept drift). Exploration consists of the selection and investigation of transactions with the purpose of improving predictive models, and exploitation consists of investigating transactions detected to be suspicious. Modeling the detection of fraudulent transactions as rewarding, we use an incremental Regression Tree learner to create clusters of transactions with similar expected rewards. This enables the use of a Contextual Multi-Armed Bandit (CMAB) algorithm to provide the exploration/exploitation trade-off. We introduce a novel variant of a CMAB algorithm that makes use of the structure of this tree, and use Semi-Supervised Learning to grow the tree using unlabeled data. The approach is evaluated on a real dataset and data generated by a simulator that adds concept drift by adapting the behavior of fraudsters to avoid detection. It outperforms frequently used offline models in terms of cumulative rewards, in particular in the presence of concept drift.

EWRL Workshop 2018 Workshop Paper

Directed Policy Gradient for Safe Reinforcement Learning with Human Advice

  • Hélène Plisnier
  • Denis Steckelmacher
  • Tim Brys
  • Diederik Roijers
  • Ann Nowé

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people’s preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient. Our technique, Directed Policy Gradient (DPG), allows a teacher or backup policy to override the agent before it acts undesirably, while allowing the agent to leverage human advice or directives to learn faster. Our experiments demonstrate that DPG makes the agent learn much faster than reward-based approaches, while requiring an order of magnitude less advice. . Keywords: Policy Shaping, Human Advice, Policy Gradient

AAMAS Conference 2018 Conference Paper

Introspective Reinforcement Learning and Learning from Demonstration

  • Mao Li
  • Tim Brys
  • Daniel Kudenko

Reinforcement learning is a paradigm used to model how an autonomous agent learns to maximize its cumulative reward by interacting with the environment. One challenge faced by reinforcement learning is that in many environments the reward signal is sparse, leading to slow improvement of the agent’s performance in early learning episodes. Potential-based reward shaping is a technique that can resolve the aforementioned issue of sparse reward by incorporating an expert’s domain knowledge in the learning via a potential function. Past work on reinforcement learning from demonstration directly mapped (sub-optimal) human expert demonstrations to a potential function, which can speed up reinforcement learning. In this paper we propose an introspective reinforcement learning agent that significantly speeds up the learning further. An introspective reinforcement learning agent records its stateaction decisions and experiences during learning in a priority queue. Good quality decisions will be kept in the queue, while poorer decisions will be rejected. The queue is then used as demonstration to speed up reinforcement learning via reward shaping. An expert agent’s demonstrations can be used to initialise the priority queue before the learning process starts. Experimental validations in the 4-dimensional CartPole domain and the 27-dimensional Super Mario AI domain show that our approach significantly outperforms state-of-the-art approaches to reinforcement learning from demonstration in both domains.

AAMAS Conference 2016 Conference Paper

Learning from Demonstration for Shaping through Inverse Reinforcement Learning

  • Halit Bener Suay
  • Tim Brys
  • Matthew E. Taylor
  • Sonia Chernova

Model-free episodic reinforcement learning problems define the environment reward with functions that often provide only sparse information throughout the task. Consequently, agents are not given enough feedback about the fitness of their actions until the task ends with success or failure. Previous work addresses this problem with reward shaping. In this paper we introduce a novel approach to improve modelfree reinforcement learning agents’ performance with a three step approach. Specifically, we collect demonstration data, use the data to recover a linear function using inverse reinforcement learning and we use the recovered function for potential-based reward shaping. Our approach is model-free and scalable to high dimensional domains. To show the scalability of our approach we present two sets of experiments in a two dimensional Maze domain, and the 27 dimensional Mario AI domain. We compare the performance of our algorithm to previously introduced reinforcement learning from demonstration algorithms. Our experiments show that our approach outperforms the state-of-the-art in cumulative reward, learning rate and asymptotic performance.

IJCAI Conference 2015 Conference Paper

Encoding and Combining Knowledge to Speed up Reinforcement Learning

  • Tim Brys

Reinforcement learning algorithms typically require too many ‘trial-and-error’ experiences before reaching a desirable behaviour. A considerable amount of ongoing research is focused on speeding up this learning process by using external knowledge. We contribute in several ways, proposing novel approaches to transfer learning and learning from demonstration, as well as an ensemble approach to combine knowledge from various sources.

IJCAI Conference 2015 Conference Paper

Reinforcement Learning from Demonstration through Shaping

  • Tim Brys
  • Anna Harutyunyan
  • Halit Bener Suay
  • Sonia Chernova
  • Matthew E. Taylor
  • Ann Now
  • eacute;

Reinforcement learning describes how a learning agent can achieve optimal behaviour based on interactions with its environment and reward feedback. A limiting factor in reinforcement learning as employed in artificial intelligence is the need for an often prohibitively large number of environment samples before the agent reaches a desirable level of performance. Learning from demonstration is an approach that provides the agent with demonstrations by a supposed expert, from which it should derive suitable behaviour. Yet, one of the challenges of learning from demonstration is that no guarantees can be provided for the quality of the demonstrations, and thus the learned behavior. In this paper, we investigate the intersection of these two approaches, leveraging the theoretical guarantees provided by reinforcement learning, and using expert demonstrations to speed up this learning by biasing exploration through a process called reward shaping. This approach allows us to leverage human input without making an erroneous assumption regarding demonstration optimality. We show experimentally that this approach requires significantly fewer demonstrations, is more robust against suboptimality of demonstrations, and achieves much faster learning than the recently developed HAT algorithm.

RLDM Conference 2015 Conference Abstract

Reward Shaping by Demonstration

  • Halit Suay
  • Sonia Chernova
  • Tim Brys
  • Vrije Universiteit Brussel
  • Matthew Taylor

Potential-based reward shaping is a theoretically sound way of incorporating prior knowledge in a reinforcement learning setting. While providing flexibility for choosing the potential function, this method guarantees the convergence of the final policy, regardless of the properties of the potential function. How- ever, this flexibility of choice, may cause confusion when making a design decision for a specific domain, as the number of possible candidates for a potential function can be overwhelming. Moreover, the poten- tial function either can be manually designed, to bias the behavior of the learner, or can be recovered from prior knowledge, e. g. from human demonstrations. In this paper we investigate the efficacy of two different ways for using a potential function recovered from human demonstrations. First approach uses a mixture of Gaussian distributions generated by samples collected during demonstrations (Gaussian-Shaping), and the second approach uses a reward function recovered from demonstrations with Relative Entropy Inverse Re- inforcement Learning (RE-IRL-Shaping). We present our findings in Cart-Pole, Mountain Car, and Puddle World domains. Our results show that Gaussian-Shaping can provide an efficient reward heuristic, acceler- ating learning through its ability to capture local information, and RE-IRL-Shaping can be more resilient to bad demonstrations. We report a brief analysis of our findings and we aim to provide a future reference for reinforcement learning agent designers, who consider using reward shaping by human demonstrations.

EWRL Workshop 2015 Workshop Paper

Using PCA to Efficiently Represent State Spaces

  • William Curran
  • Tim Brys
  • Matthew Taylor
  • William Smart

Reinforcement learning algorithms need to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces. This is known as the curse of dimensionality. By projecting the agent’s state onto a low-dimensional manifold, we can represent the state space in a smaller and more efficient representation. By using this representation during learning, the agent can converge to a good policy much faster. We test this approach in the Mario Benchmarking Domain. When using dimensionality reduction in Mario, learning converges much faster to a good policy. But, there is a critical convergence-performance trade-off. By projecting onto a low-dimensional manifold, we are ignoring important data. In this paper, we explore this trade-off of convergence and performance. We find that learning in as few as 4 dimensions (instead of 9), we can improve performance past learning in the full dimensional space at a faster convergence rate.

AAAI Conference 2014 Conference Paper

Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence

  • Tim Brys
  • Ann Nowé
  • Daniel Kudenko
  • Matthew Taylor

Multi-objective problems with correlated objectives are a class of problems that deserve specific attention. In contrast to typical multi-objective problems, they do not require the identification of trade-offs between the objectives, as (near-) optimal solutions for any objective are (near-) optimal for every objective. Intelligently combining the feedback from these objectives, instead of only looking at a single one, can improve optimization. This class of problems is very relevant in reinforcement learning, as any single-objective reinforcement learning problem can be framed as such a multiobjective problem using multiple reward shaping functions. After discussing this problem class, we propose a solution technique for such reinforcement learning problems, called adaptive objective selection. This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective’s estimates to use for action selection. We show significant improvements in performance over other plausible techniques on two problem domains. Finally, we provide an intuitive analysis of the technique’s decisions, yielding insights into the nature of the problems being solved.

ECAI Conference 2014 Conference Paper

Using Ensemble Techniques and Multi-Objectivization to Solve Reinforcement Learning Problems

  • Tim Brys
  • Matthew E. Taylor
  • Ann Nowé

Recent work on multi-objectivization has shown how a single-objective reinforcement learning problem can be turned into a multi-objective problem with correlated objectives, by providing multiple reward shaping functions. The information contained in these correlated objectives can be exploited to solve the base, single-objective problem faster and better, given techniques specifically aimed at handling such correlated objectives. In this paper, we identify ensemble techniques as a set of methods that is suitable to solve multi-objectivized reinforcement learning problems. We empirically demonstrate their use on the Pursuit domain.

EUMAS Conference 2011 Conference Paper

Local Coordination in Online Distributed Constraint Optimization Problems

  • Tim Brys
  • Yann-Michaël De Hauwere
  • Ann Nowé
  • Peter Vrancx

Abstract In cooperative multi-agent systems, group performance often depends more on the interactions between team members, rather than on the performance of any individual agent. Hence, coordination among agents is essential to optimize the group strategy. One solution which is common in the literature is to let the agents learn in a joint action space. Joint Action Learning (JAL) enables agents to explicitly take into account the actions of other agents, but has the significant drawback that the action space in which the agents must learn scales exponentially in the number of agents. Local coordination is a way for a team to coordinate while keeping communication and computational complexity low. It allows the exploitation of a specific dependency structure underlying the problem, such as tight couplings between specific agents. In this paper we investigate a novel approach to local coordination, in which agents learn this dependency structure, resulting in coordination which is beneficial to the group performance. We evaluate our approach in the context of online distributed constraint optimization problems.