Arrow Research search

Author name cluster

Ann Nowé

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

57 papers
2 author rows

Possible papers

57

AAMAS Conference 2025 Conference Paper

A JAX-Accelerated Simulation Framework for Multi-Agent Energy Management in Energy Communities

  • Hicham Azmani
  • Andries Rosseau
  • Marjon Blondeel
  • Ann Nowé

The global push toward renewable energy has accelerated the formation of energy communities, where households produce and share electricity locally to reduce grid strain and promote sustainability. Regulators and researchers are still exploring optimal energy exchange mechanisms. However, communities often decide how they will exchange energy themselves, even though many lack the information and tools needed to make informed decisions. To address these challenges, we introduce a JAX-accelerated simulation-based framework that allows researchers to prototype and evaluate diverse energy exchange models under realistic conditions. On top of this framework, we present an interactive demonstration targeted at legislators, citizens, and other non-technical stakeholders, offering an intuitive introduction to foundational concepts and a hands-on environment for experimenting with different community setups and exchange mechanisms. The project website is available here. Beyond technology, our work is grounded in an interdisciplinary project integrating legal analysis, social science research, and public engagement via citizen jury sessions. By bridging these domains, we aim to empower communities and decision-makers to make more informed, equitable choices in transitioning to sustainable energy systems.

JAIR Journal 2025 Journal Article

Collective Intelligence in Decision-Making with Non-Stationary Experts

  • Axel Abels
  • Vito Trianni
  • Ann Nowé
  • Tom Lenaerts

When sufficient experience to make informed decisions is unavailable, expert advice can help us navigate uncertainty. As expertise evolves, driven by continuous learning in human experts or model updates in artificial experts, it is crucial to adopt adaptive approaches. Existing methods for exploiting non-stationary experts focus on competing with the single best expert. In contrast, this work harnesses the power of collective intelligence to facilitate better decision-making in the face of evolving expertise or dynamic environments. To achieve this, we propose the novel CORVAL approach which optimally combines the insights of multiple experts. By adapting to drifts in expertise, our novel approach can surpass the performance of the single best expert as well as previous approaches. Empirical evaluations on a diverse range of non-stationary problems, including active learning applications, showcase the improved performance of our approach in collective decision-making scenarios.

AAMAS Conference 2025 Conference Paper

Composing Reinforcement Learning Policies, with Formal Guarantees

  • Florent Delgrange
  • Guy Avni
  • Anna Lukina
  • Christian Schilling
  • Ann Nowé
  • Guillermo A. Pérez

We propose a novel framework to controller design in environments with a two-level structure: a known high-level graph (“map”) in which each vertex is populated by a Markov decision process, called a “room”. The framework “separates concerns” by using different design techniques for low- and high-level tasks. We apply reactive synthesis for high-level tasks: given a specification as a logical formula over the high-level graph and a collection of low-level policies obtained together with “concise” latent structures, we construct a “planner” that selects which low-level policy to apply in each room. We develop a reinforcement learning procedure to train low-level policies on latent structures, which unlike previous approaches, circumvents a model distillation step. We pair the policy with probably approximately correct guarantees on its performance and on the abstraction quality, and lift these guarantees to the high-level task. These formal guarantees are the main advantage of the framework. Other advantages include scalability (rooms are large and their dynamics are unknown) and reusability of low-level policies. We demonstrate feasibility in challenging case studies where an agent navigates environments with moving obstacles and visual inputs.

AAMAS Conference 2025 Conference Paper

Curiosity-Driven Partner Selection Accelerates Convention Emergence in Language Games

  • Chin-wing Leung
  • Paolo Turrini
  • Ann Nowé

In language games a speaker and a listener attempt to coordinate on a shared mapping between words and concepts. The usual approach in the literature is to study convention emergence in well-mixed populations, where pairs of agents are randomly matched to play the role of speaker and listener, respectively. This way of pairing agents can be shown to promote the emergence of a unifying common language in the long run. Despite the theoretical guarantee, convention emergence can be very slow and practically unfeasible, especially in large populations with many words and concepts. Here, we propose an alternative approach, where we allow agents to selectively partner with other agents based on their past experience. To this aim, we study Boltzmann Q-learning agents that are curiosity-driven, i. e. , more likely to choose partners they misunderstood in the past. We show that this selection method significantly accelerates convention emergence when compared against a random-matching baseline and is even more pronounced in graph generation models restricting agents’ communication channels. By inspecting the evolution of the agents’ interaction frequency we see that partner selection induces low treewidth and high degree variance at the early stages of learning, to then converge to a regular graph, which allows for settling misunderstandings in the population at a faster rate than the traditional approaches.

AAMAS Conference 2025 Conference Paper

Divide and Conquer: Provably Unveiling the Pareto Front with Multi-Objective Reinforcement Learning

  • Willem Röpke
  • Mathieu Reymond
  • Patrick Mannion
  • Diederik M. Roijers
  • Ann Nowé
  • Roxana RadulescuXXX

An important challenge in multi-objective reinforcement learning is obtaining a Pareto front of policies to attain optimal performance under di"erent preferences. We introduce Iterated Pareto Referent Optimisation (IPRO), which decomposes# nding the Pareto front into a sequence of constrained single-objective problems. This enables us to guarantee convergence while providing an upper bound on the distance to undiscovered Pareto optimal solutions at each step. We evaluate IPRO using utility-based metrics and its hypervolume and# nd that it matches or outperforms methods that require additional assumptions. By leveraging problem-speci#c single-objective solvers, our approach also holds promise for applications beyond multi-objective reinforcement learning, such as planning and path#nding.

AAMAS Conference 2024 Conference Paper

A Reinforcement Learning Framework for Studying Group and Individual Fairness

  • Alexandra Cimpean
  • Catholijn Jonker
  • Pieter Libin
  • Ann Nowé

Reinforcement learning is a commonly used technique for optimising objectives in decision support systems for complex problem solving. When these systems affect individuals or groups, it is essential to reflect on fairness. As absolute fairness is in practice not achievable, we propose a framework which allows to balance distinct fairness notions along with the primary objective. To this end, we formulate group and individual fairness in sequential fairness notions. First, we present an extended Markov decision process, 𝑓 MDP, that is explicitly aware of individuals and groups. Next, we formalise fairness notions in terms of this 𝑓 MDP which allows us to evaluate the primary objective along with the fairness notions that are important to the user, taking a multi-objective reinforcement learning approach. To evaluate our framework, we consider two scenarios that require distinct aspects of the performance-fairness trade-off: job hiring and fraud detection. The objectives in job hiring are to compose strong teams, while providing equal treatment to similar individual applicants and to groups in society. The trade-off in fraud detection is the necessity of detecting fraudulent transactions, while distributing the burden for customers of checking transactions fairly. In this framework, we further explore the influence of distance metrics on individual fairness and highlight the impact of the history size on the fairness calculations and the obtainable fairness through exploration.

AAMAS Conference 2024 Conference Paper

Interactively Learning the User's Utility for Best-Arm Identification in Multi-Objective Multi-Armed Bandits

  • Mathieu Reymond
  • Eugenio Bargiacchi
  • Diederik M. Roijers
  • Ann Nowé

Many real-world problems have multiple, conflicting objectives. Without knowing the utility function of the decision maker, one must extensively learn all Pareto-efficient trade-offs to make sure that the true preferred policy is included in the learned set. Because such thorough exploration can be expensive (especially in highdimensional multi-objective problems), a possible alternative is to allow some form of interaction with the decision maker as to gain some information about the utility function. In particular, in this work we assume that limited queries can be made to the policy maker to gather some information about the true utility function, concurrently to the search process being carried out. Improving our knowledge over the utility function narrows the search-space of the optimal policy. In turn, this results in more relevant trade-offs used to query the decision maker. Thus, correctly timing the queries is crucial to maximize information gain. We refer to this setting as fixed-budget best-arm identification for multi-objective multiarmed bandits, which adds to the traditional arm-pull actions a separate query-action that can be taken instead, where both actions have fixed but separate budgets. We propose Monte-Carlo Bayesian Utility Learning (MCBUL), a method based on Monte-Carlo planning that is able to optimize the timing of query-actions w. r. t. the arm-pull actions. We show that MCBUL significantly improves the chances of finding the optimal policy compared to baselines that interact with the decision maker at fixed intervals.

ICLR Conference 2024 Conference Paper

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

  • Raphaël Avalos
  • Florent Delgrange
  • Ann Nowé
  • Guillermo A. Pérez
  • Diederik M. Roijers

Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the state cannot be perceived, necessitating reasoning based on past observations and actions. However, remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over the current state can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update under the assumption that the state is observable during training. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our latent beliefs allow for learning the optimal value function.

AAMAS Conference 2024 Conference Paper

Trustworthy Reinforcement Learning: Opportunities and Challenges

  • Ann Nowé

Reinforcement Learning (RL) has long outgrown the traditional representations that guaranteed policy convergence but severely limited its application to complex domains. Modern Deep RL enables far richer and complex behaviour, yet at the cost of transparency and explainability. While these latter issues have recently received much attention in Machine Learning, they are underexplored in RL. In this talk, I will discuss them from multiple angles, survey state-of-the-art approaches, including recent developments in policy distillation and formal guarantees, and touch upon the related question of fairness.

AAMAS Conference 2023 Conference Paper

A Brief Guide to Multi-Objective Reinforcement Learning and Planning

  • Conor F. Hayes
  • Roxana Rădulescu
  • Eugenio Bargiacchi
  • Johan Källström
  • Matthew Macfarlane
  • Mathieu Reymond
  • Timothy Verstraeten
  • Luisa M. Zintgraf

Real-world sequential decision-making tasks are usually complex, and require trade-offs between multiple – often conflicting – objectives. However, the majority of research in reinforcement learning (RL) and decision-theoretic planning assumes a single objective, or that multiple objectives can be handled via a predefined weighted sum over the objectives. Such approaches may oversimplify the underlying problem, and produce suboptimal results. This extended abstract outlines the limitations of using a semi-blind iterative process to solve multi-objective decision making problems. Our extended paper [4], serves as a guide for the application of explicitly multi-objective methods to difficult problems.

AAMAS Conference 2023 Conference Paper

A Study of Nash Equilibria in Multi-Objective Normal-Form Games

  • Willem Röpke
  • Diederik M. Roijers
  • Ann Nowé
  • Roxana Rădulescu

We present a detailed analysis of Nash equilibria in multi-objective normal-form games, which are normal-form games with vectorial payoffs. Our approach is based on modelling each player’s utility using a utility function that maps a vector to a scalar utility. For mixed strategies, we can apply the utility function before calculating the expectation of the payoff vector as well as after, resulting in two distinct optimisation criteria. We show that when computing the utility from the expected payoff, a Nash equilibrium can be guaranteed when players have quasiconcave utility functions. In addition, we show that when players have quasiconvex utility functions, pure strategy Nash equilibria are equal under both optimisation criteria. We extend this to settings where some players optimise for one criterion, while others optimise for the second. We combine these results and formulate an algorithm that computes all pure strategy Nash equilibria given quasiconvex utility functions.

JAAMAS Journal 2023 Journal Article

Actor-critic multi-objective reinforcement learning for non-linear utility functions

  • Mathieu Reymond
  • Conor F. Hayes
  • Ann Nowé

Abstract We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.

AAMAS Conference 2023 Conference Paper

Bridging the Gap Between Single and Multi Objective Games

  • Willem Röpke
  • Carla Groenland
  • Roxana Rădulescu
  • Ann Nowé
  • Diederik M. Roijers

A classic model to study strategic decision making in multi-agent systems is the normal-form game. This model can be generalised to allow for an infinite number of pure strategies leading to continuous games. Multi-objective normal-form games are another generalisation that model settings where players receive separate payoffs in more than one objective. We bridge the gap between the two models by providing a theoretical guarantee that a game from one setting can always be transformed to a game in the other. We extend the theoretical results to include guaranteed equivalence of Nash equilibria. The mapping makes it possible to apply algorithms from one field to the other. We demonstrate this by introducing a fictitious play algorithm for multi-objective games and subsequently applying it to two well-known continuous games. We believe the equivalence relation will lend itself to new insights by translating the theoretical guarantees from one formalism to another. Moreover, it may lead to new computational approaches for continuous games when a problem is more naturally solved in the succinct format of multi-objective games.

IJCAI Conference 2023 Conference Paper

Distributional Multi-Objective Decision Making

  • Willem Röpke
  • Conor F. Hayes
  • Patrick Mannion
  • Enda Howley
  • Ann Nowé
  • Diederik M. Roijers

For effective decision support in scenarios with conflicting objectives, sets of potentially optimal solutions can be presented to the decision maker. We explore both what policies these sets should contain and how such sets can be computed efficiently. With this in mind, we take a distributional approach and introduce a novel dominance criterion relating return distributions of policies directly. Based on this criterion, we present the distributional undominated set and show that it contains optimal policies otherwise ignored by the Pareto front. In addition, we propose the convex distributional undominated set and prove that it comprises all policies that maximise expected utility for multivariate risk-averse decision makers. We propose a novel algorithm to learn the distributional undominated set and further contribute pruning operators to reduce the set to the convex distributional undominated set. Through experiments, we demonstrate the feasibility and effectiveness of these methods, making this a valuable new approach for decision support in real-world problems.

ICML Conference 2023 Conference Paper

Expertise Trees Resolve Knowledge Limitations in Collective Decision-Making

  • Axel Abels
  • Tom Lenaerts
  • Vito Trianni
  • Ann Nowé

Experts advising decision-makers are likely to display expertise which varies as a function of the problem instance. In practice, this may lead to sub-optimal or discriminatory decisions against minority cases. In this work, we model such changes in depth and breadth of knowledge as a partitioning of the problem space into regions of differing expertise. We provide here new algorithms that explicitly consider and adapt to the relationship between problem instances and experts’ knowledge. We first propose and highlight the drawbacks of a naive approach based on nearest neighbor queries. To address these drawbacks we then introduce a novel algorithm — expertise trees — that constructs decision trees enabling the learner to select appropriate models. We provide theoretical insights and empirically validate the improved performance of our novel approach on a range of problems for which existing methods proved to be inadequate.

AAMAS Conference 2023 Conference Paper

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

  • Lucas N. Alegre
  • Ana L. C. Bazzan
  • Diederik M. Roijers
  • Ann Nowé
  • Bruno C. da Silva

Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an 𝜖-optimal solution (for a bounded 𝜖) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.

ICLR Conference 2023 Conference Paper

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

  • Florent Delgrange
  • Ann Nowé
  • Guillermo A. Pérez

Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

AAAI Conference 2022 Conference Paper

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

  • Florent Delgrange
  • Ann Nowé
  • Guillermo A. Pérez

We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In wellbehaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep- RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.

AAMAS Conference 2022 Conference Paper

Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

  • Raphaël Avalos
  • Mathieu Reymond
  • Ann Nowé
  • Diederik M. Roijers

Multi-agent reinforcement learning (MARL) enables us to create adaptive agents in challenging environments, even when the agents have limited observation. Modern MARL methods have focused on finding factorized value functions. While successful, the resulting methods have convoluted network structures. We take a radically different approach and build on the structure of independent Qlearners. Our algorithm LAN leverages a dueling architecture to represent decentralized policies as separate individual advantage functions w. r. t. a centralized critic that is cast aside after training. The critic works as a stabilizer that coordinates the learning and to formulate DQN targets. This enables LAN to keep the number of parameters of its centralized network independent in the number of agents, without imposing additional constraints like monotonic value functions. When evaluated on the SMAC, LAN shows SOTA performance overall and scores more than 80% wins in two superhard maps where even QPLEX does not obtain almost any wins. Moreover, when the number of agents becomes large, LAN uses significantly fewer parameters than QPLEX or even QMIX. We thus show that LAN’s structure forms a key improvement that helps MARL methods remain scalable.

EWRL Workshop 2022 Workshop Paper

Local Advantage Networks for Multi-Agent Reinforcement Learning in Dec-POMDPs

  • Raphael Avalos
  • Mathieu Reymond
  • Ann Nowé
  • Diederik M Roijers

Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network’s size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is more scalable with respect to the number of agents, opening up a new promising direction for MARL research.

AAMAS Conference 2022 Conference Paper

Pareto Conditioned Networks

  • Mathieu Reymond
  • Eugenio Bargiacchi
  • Ann Nowé

In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions is an expensive process. The set of optimal policies can grow exponentially with the number of objectives, and recovering all solutions requires an exhaustive exploration of the entire state space. We propose Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all nondominated policies. PCN associates every past transition with its episode’s return. It trains the network such that, when conditioned on this same return, it should reenact said transition. In doing so we transform the optimization problem into a classification problem. We recover a concrete policy by conditioning the network on the desired Pareto-efficient solution. Our method is stable as it learns in a supervised fashion, thus avoiding moving target issues. Moreover, by using a single network, PCN scales efficiently with the number of objectives. Finally, it makes minimal assumptions on the shape of the Pareto front, which makes it suitable to a wider range of problems than previous state-of-the-art multi-objective reinforcement learning algorithms.

KER Journal 2020 Journal Article

A utility-based analysis of equilibria in multi-objective normal-form games

  • Roxana Rădulescu
  • Patrick Mannion
  • Yijie Zhang
  • Diederik M. Roijers
  • Ann Nowé

Abstract In multi-objective multi-agent systems (MOMASs), agents explicitly consider the possible trade-offs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analyzed on the basis of the utility that these compromises have for the users of a system, where an agent’s utility function maps their payoff vectors to scalar utility values. This utility-based approach naturally leads to two different optimization criteria for agents in a MOMAS: expected scalarized returns (ESRs) and scalarized expected returns (SERs). In this article, we explore the differences between these two criteria using the framework of multi-objective normal-form games (MONFGs). We demonstrate that the choice of optimization criterion (ESR or SER) can radically alter the set of equilibria in a MONFG when nonlinear utility functions are used.

JMLR Journal 2020 Journal Article

AI-Toolbox: A C++ library for Reinforcement Learning and Planning (with Python Bindings)

  • Eugenio Bargiacchi
  • Diederik M. Roijers
  • Ann Nowé

This paper describes AI-Toolbox, a C++ software library that contains reinforcement learning and planning algorithms, and supports both single and multi agent problems, as well as partial observability. It is designed for simplicity and clarity, and contains extensive documentation of its API and code. It supports Python to enable users not comfortable with C++ to take advantage of the library's speed and functionality. AI-Toolbox is free software, and is hosted online at https://github.com/Svalorzen/AI-Toolbox. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

ECAI Conference 2020 Conference Paper

Fleet Control Using Coregionalized Gaussian Process Policy Iteration

  • Timothy Verstraeten
  • Pieter J. K. Libin
  • Ann Nowé

In many settings, as for example wind farms, multiple machines are instantiated to perform the same task, which is called a fleet. The recent advances with respect to the Internet of Things allow control devices and/or machines to connect through cloud-based architectures in order to share information about their status and environment. Such an infrastructure allows seamless data sharing between fleet members, which could greatly improve the sample-efficiency of reinforcement learning techniques. However in practice, these machines, while almost identical in design, have small discrepancies due to production errors or degradation, preventing control algorithms to simply aggregate and employ all fleet data. We propose a novel reinforcement learning method that learns to transfer knowledge between similar fleet members and creates member-specific dynamical models for control. Our algorithm uses Gaussian processes to establish cross-member covariances. This is significantly different from standard transfer learning methods, as the focus is not on sharing information over tasks, but rather over system specifications. We demonstrate our approach on two benchmarks and a realistic wind farm setting. Our method significantly outperforms two baseline approaches, namely individual learning and joint learning where all samples are aggregated, in terms of the median and variance of the results.

KER Journal 2020 Journal Article

Toll-based reinforcement learning for efficient equilibria in route choice

  • Gabriel de O. Ramos
  • Bruno C. da Silva
  • Roxana Rădulescu
  • Ana L. C. Bazzan
  • Ann Nowé

Abstract The problem of traffic congestion incurs numerous social and economical repercussions and has thus become a central issue in every major city in the world. For this work we look at the transportation domain from a multiagent system perspective, where every driver can be seen as an autonomous decision-making agent. We explore how learning approaches can help achieve an efficient outcome, even when agents interact in a competitive environment for sharing common resources. To this end, we consider the route choice problem, where self-interested drivers need to independently learn which routes minimise their expected travel costs. Such a selfish behaviour results in the so-called user equilibrium, which is inefficient from the system’s perspective. In order to mitigate the impact of selfishness, we present Toll-based Q-learning (TQ-learning, for short). TQ-learning employs the idea of marginal-cost tolling (MCT), where each driver is charged according to the cost it imposes on others. The use of MCT leads agents to behave in a socially desirable way such that the is attainable. In contrast to previous works, however, our tolling scheme is distributed (i.e., each agent can compute its own toll), is charged a posteriori (i.e., at the end of each trip), and is fairer (i.e., agents pay exactly their marginal costs). Additionally, we provide a general formulation of the toll values for univariate, homogeneous polynomial cost functions. We present a theoretical analysis of TQ-learning, proving that it converges to a system-efficient equilibrium (i.e., an equilibrium aligned to the system optimum) in the limit. Furthermore, we perform an extensive empirical evaluation on realistic road networks to support our theoretical findings, showing that TQ-learning indeed converges to the optimum, which translates into a reduction of the congestion levels by 9.1%, on average.

KER Journal 2019 Journal Article

Action learning and grounding in simulated human–robot interactions

  • Oliver Roesler
  • Ann Nowé

Abstract In order to enable robots to interact with humans in a natural way, they need to be able to autonomously learn new tasks. The most natural way for humans to tell another agent, which can be a human or robot, to perform a task is via natural language. Thus, natural human–robot interactions also require robots to understand natural language, i.e. extract the meaning of words and phrases. To do this, words and phrases need to be linked to their corresponding percepts through grounding. Afterward, agents can learn the optimal micro-action patterns to reach the goal states of the desired tasks. Most previous studies investigated only learning of actions or grounding of words, but not both. Additionally, they often used only a small set of tasks as well as very short and unnaturally simplified utterances. In this paper, we introduce a framework that uses reinforcement learning to learn actions for several tasks and cross-situational learning to ground actions, object shapes and colors, and prepositions. The proposed framework is evaluated through a simulated interaction experiment between a human tutor and a robot. The results show that the employed framework can be used for both action learning and grounding.

ICML Conference 2019 Conference Paper

Dynamic Weights in Multi-Objective Deep Reinforcement Learning

  • Axel Abels
  • Diederik M. Roijers
  • Tom Lenaerts
  • Ann Nowé
  • Denis Steckelmacher

Many real-world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as a tabular Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives and we introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the Dynamic Weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective Reinforcement Learning and show that our proposed network in combination with DER dominates these adapted algorithms across weight change scenarios and problem domains.

JAAMAS Journal 2019 Journal Article

Multi-objective multi-agent decision making: a utility-based analysis and survey

  • Roxana Rădulescu
  • Patrick Mannion
  • Ann Nowé

Abstract The majority of multi-agent system implementations aim to optimise agents’ policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems.

ICML Conference 2019 Conference Paper

Per-Decision Option Discounting

  • Anna Harutyunyan
  • Peter Vrancx
  • Philippe Hamel
  • Ann Nowé
  • Doina Precup

In order to solve complex problems an agent must be able to reason over a sufficiently long horizon. Temporal abstraction, commonly modeled through options, offers the ability to reason at many timescales, but the horizon length is still determined by the discount factor of the underlying Markov Decision Process. We propose a modification to the options framework that naturally scales the agent’s horizon with option length. We show that the proposed option-step discount controls a bias-variance trade-off, with larger discounts (counter-intuitively) leading to less estimation variance.

AAAI Conference 2018 Conference Paper

Adapting to Concept Drift in Credit Card Transaction Data Streams Using Contextual Bandits and Decision Trees

  • Dennis Soemers
  • Tim Brys
  • Kurt Driessens
  • Mark Winands
  • Ann Nowé

Credit card transactions predicted to be fraudulent by automated detection systems are typically handed over to human experts for verification. To limit costs, it is standard practice to select only the most suspicious transactions for investigation. We claim that a trade-off between exploration and exploitation is imperative to enable adaptation to changes in behavior (concept drift). Exploration consists of the selection and investigation of transactions with the purpose of improving predictive models, and exploitation consists of investigating transactions detected to be suspicious. Modeling the detection of fraudulent transactions as rewarding, we use an incremental Regression Tree learner to create clusters of transactions with similar expected rewards. This enables the use of a Contextual Multi-Armed Bandit (CMAB) algorithm to provide the exploration/exploitation trade-off. We introduce a novel variant of a CMAB algorithm that makes use of the structure of this tree, and use Semi-Supervised Learning to grow the tree using unlabeled data. The approach is evaluated on a real dataset and data generated by a simulator that adds concept drift by adapting the behavior of fraudsters to avoid detection. It outperforms frequently used offline models in terms of cumulative rewards, in particular in the presence of concept drift.

EWRL Workshop 2018 Workshop Paper

Directed Policy Gradient for Safe Reinforcement Learning with Human Advice

  • Hélène Plisnier
  • Denis Steckelmacher
  • Tim Brys
  • Diederik Roijers
  • Ann Nowé

Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people’s preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient. Our technique, Directed Policy Gradient (DPG), allows a teacher or backup policy to override the agent before it acts undesirably, while allowing the agent to leverage human advice or directives to learn faster. Our experiments demonstrate that DPG makes the agent learn much faster than reward-based approaches, while requiring an order of magnitude less advice. . Keywords: Policy Shaping, Human Advice, Policy Gradient

ICML Conference 2018 Conference Paper

Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems

  • Eugenio Bargiacchi
  • Timothy Verstraeten
  • Diederik M. Roijers
  • Ann Nowé
  • Hado van Hasselt

Learning to coordinate between multiple agents is an important problem in many reinforcement learning problems. Key to learning to coordinate is exploiting loose couplings, i. e. , conditional independences between agents. In this paper we study learning in repeated fully cooperative games, multi-agent multi-armed bandits (MAMABs), in which the expected rewards can be expressed as a coordination graph. We propose multi-agent upper confidence exploration (MAUCE), a new algorithm for MAMABs that exploits loose couplings, which enables us to prove a regret bound that is logarithmic in the number of arm pulls and only linear in the number of agents. We empirically compare MAUCE to sparse cooperative Q-learning, and a state-of-the-art combinatorial bandit approach, and show that it performs much better on a variety of settings, including learning control policies for wind farms.

AAAI Conference 2018 Conference Paper

Learning With Options That Terminate Off-Policy

  • Anna Harutyunyan
  • Peter Vrancx
  • Pierre-Luc Bacon
  • Doina Precup
  • Ann Nowé

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with wellstudied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

AAAI Conference 2018 Conference Paper

Reinforcement Learning in POMDPs With Memoryless Options and Option-Observation Initiation Sets

  • Denis Steckelmacher
  • Diederik Roijers
  • Anna Harutyunyan
  • Peter Vrancx
  • Hélène Plisnier
  • Ann Nowé

Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more ef- ficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, while being much more sample-efficient than a recurrent neural network over options.

EWRL Workshop 2018 Workshop Paper

Stable, Practical and On-line Bootstrapped Conservative Policy Iteration

  • Denis Steckelmacher
  • Hélène Plisnier
  • Diederik M. Roijers
  • Ann Nowé

We consider on-line model-free reinforcement learning with discrete actions, and focus on sample-efficiency and exploration quality. In this setting, value-based methods of the Q-Learning family achieve state-of-the-art results, while actor-critic algorithms, that learn an explicit actor policy in addition to their value function, do not achieve empirical results matching their theoretical promises. The majority of actor-critic algorithms combine an on-policy critic with an actor learned with Policy Gradient. In this paper, we propose an alternative to these two components. We base our work on Conservative Policy Iteration (CPI), leading to a new non-parametric actor learning rule, and Dual Policy Iteration (DPI), that motivates our use of aggressive off-policy critics. Our empirical results demonstrate a sample-efficiency and robustness superior to state-of-the-art value-based and actor-critic approaches, in three challenging environments.

KER Journal 2016 Journal Article

A reinforcement learning approach to coordinate exploration with limited communication in continuous action games

  • Abdel Rodríguez
  • Peter Vrancx
  • Ricardo Grau
  • Ann Nowé

Abstract Learning automata are reinforcement learners belonging to the class of policy iterators. They have already been shown to exhibit nice convergence properties in a wide range of discrete action game settings. Recently, a new formulation for a continuous action reinforcement learning automata (CARLA) was proposed. In this paper, we study the behavior of these CARLA in continuous action games and propose a novel method for coordinated exploration of the joint-action space. Our method allows a team of independent learners, using CARLA, to find the optimal joint action in common interest settings. We first show that independent agents using CARLA will converge to a local optimum of the continuous action game. We then introduce a method for coordinated exploration which allows the team of agents to find the global optimum of the game. We validate our approach in a number of experiments.

KER Journal 2016 Journal Article

Context-sensitive reward shaping for sparse interaction multi-agent systems

  • Yann-Michaël de Hauwere
  • Sam Devlin
  • Daniel Kudenko
  • Ann Nowé

Abstract Potential-based reward shaping is a commonly used approach in reinforcement learning to direct exploration based on prior knowledge. Both in single and multi-agent settings this technique speeds up learning without losing any theoretical convergence guarantees. However, if speed ups through reward shaping are to be achieved in multi-agent environments, a different shaping signal should be used for each context in which agents have a different subgoal or when agents are involved in a different interaction situation. This paper describes the use of context-aware potential functions in a multi-agent system in which the interactions between agents are sparse. This means that, unknown to the agents a priori, the interactions between the agents only occur sporadically in certain regions of the state space. During these interactions, agents need to coordinate in order to reach the global optimal solution. We demonstrate how different reward shaping functions can be used on top of Future Coordinating Q-learning (FCQ-learning); an algorithm capable of automatically detecting when agents should take each other into consideration. Using FCQ-learning, coordination problems can even be anticipated before the actual problems occur, allowing the problems to be solved timely. We evaluate our approach on a range of gridworld problems, as well as a simulation of air traffic control.

ECAI Conference 2016 Conference Paper

Formalizing Commitment-Based Deals in Boolean Games

  • Sofie De Clercq
  • Steven Schockaert
  • Ann Nowé
  • Martine De Cock

Boolean games (BGs) are a strategic framework in which agents' goals are described using propositional logic. Despite the popularity of BGs, the problem of how agents can coordinate with others to (at least partially) achieve their goals has hardly received any attention. However, negotiation protocols that have been developed outside the setting of BGs can be adopted for this purpose, provided that we can formalize (i) how agents can make commitments and (ii) how deals between coalitions of agents can be identified given a set of active commitments. In this paper, we focus on these two aims. First, we show how agents can formulate commitments that are in accordance with their goals, and what it means for the commitments of an agent to be consistent. Second, we formalize deals in terms of coalitions who can achieve their goals without help from others. We show that verifying the consistency of a set of commitments of one agent is Π P2-complete while checking the existence of a deal in a set of mutual commitments is Σ p2

TAAS Journal 2015 Journal Article

A Reinforcement Learning Approach for Interdomain Routing with Link Prices

  • Peter Vrancx
  • Pasquale Gurzi
  • Abdel Rodriguez
  • Kris Steenhaut
  • Ann Nowé

In today’s Internet, the commercial aspects of routing are gaining importance. Current technology allows Internet Service Providers (ISPs) to renegotiate contracts online to maximize profits. Changing link prices will influence interdomain routing policies that are now driven by monetary aspects as well as global resource and performance optimization. In this article, we consider an interdomain routing game in which the ISP’s action is to set the price for its transit links. Assuming a cheapest path routing scheme, the optimal action is the price setting that yields the highest utility (i.e., profit) and depends both on the network load and the actions of other ISPs. We adapt a continuous and a discrete action learning automaton (LA) to operate in this framework as a tool that can be used by ISP operators to learn optimal price setting. In our model, agents representing different ISPs learn only on the basis of local information and do not need any central coordination or sensitive information exchange. Simulation results show that a single ISP employing LAs is able to learn the optimal price in a stationary environment. By introducing a selective exploration rule, LAs are also able to operate in nonstationary environments. When two ISPs employ LAs, we show that they converge to stable and fair equilibrium strategies.

AAAI Conference 2014 Conference Paper

Combining Multiple Correlated Reward and Shaping Signals by Measuring Confidence

  • Tim Brys
  • Ann Nowé
  • Daniel Kudenko
  • Matthew Taylor

Multi-objective problems with correlated objectives are a class of problems that deserve specific attention. In contrast to typical multi-objective problems, they do not require the identification of trade-offs between the objectives, as (near-) optimal solutions for any objective are (near-) optimal for every objective. Intelligently combining the feedback from these objectives, instead of only looking at a single one, can improve optimization. This class of problems is very relevant in reinforcement learning, as any single-objective reinforcement learning problem can be framed as such a multiobjective problem using multiple reward shaping functions. After discussing this problem class, we propose a solution technique for such reinforcement learning problems, called adaptive objective selection. This technique makes a temporal difference learner estimate the Q-function for each objective in parallel, and introduces a way of measuring confidence in these estimates. This confidence metric is then used to choose which objective’s estimates to use for action selection. We show significant improvements in performance over other plausible techniques on two problem domains. Finally, we provide an intuitive analysis of the technique’s decisions, yielding insights into the nature of the problems being solved.

JMLR Journal 2014 Journal Article

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies

  • Kristof van Moffaert
  • Ann Nowé

Many real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple feedback signals, in essence, one for each objective. MORL is the process of learning policies that optimize multiple criteria simultaneously. In this paper, we present a novel temporal difference learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach. This algorithm is a multi-policy algorithm that learns a set of Pareto dominating policies in a single run. We name this algorithm Pareto Q-learning and it is applicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto $Q$-learning is the updating mechanism that bootstraps sets of $Q$-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discounted reward vectors. This decomposition allows us to update the sets and to exploit the learned policies consistently throughout the state space. To balance exploration and exploitation during learning, we also propose three set evaluation mechanisms. These three mechanisms evaluate the sets of vectors to accommodate for standard action selection strategies, such as $\epsilon$-greedy. More precisely, these mechanisms use multi-objective evaluation principles such as the hypervolume measure, the cardinality indicator and the Pareto dominance relation to select the most promising actions. We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto $Q$-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. We note that (1) Pareto $Q$-learning is able to learn the entire Pareto front under the usual assumption that each state-action pair is sufficiently sampled, while (2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learned policies can be retrieved throughout the state and action space. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

JELIA Conference 2014 Conference Paper

Possibilistic Boolean Games: Strategic Reasoning under Incomplete Information

  • Sofie De Clercq
  • Steven Schockaert
  • Martine De Cock
  • Ann Nowé

Abstract Boolean games offer a compact alternative to normal-form games, by encoding the goal of each agent as a propositional formula. In this paper, we show how this framework can be naturally extended to model situations in which agents are uncertain about other agents’ goals. We first use uncertainty measures from possibility theory to semantically define (solution concepts to) Boolean games with incomplete information. Then we present a syntactic characterization of these semantics, which can readily be implemented, and we characterize the computational complexity.

KR Conference 2014 Short Paper

Using Answer Set Programming for Solving Boolean Games

  • Sofie De Clercq
  • Kim Bauters
  • Steven Schockaert
  • Martine de Cock
  • Ann Nowé

Boolean games are a framework for reasoning about the rational behaviour of agents, whose goals are formalized using propositional formulas. They offer an attractive alternative to normal-form games, because they allow for a more intuitive and more compact encoding. Unfortunately, however, there is currently no general, tailor-made method available to compute the equilibria of Boolean games. In this paper, we introduce a method for finding the pure Nash equilibria based on disjunctive answer set programming. Our method is furthermore capable of finding the core elements and the Pareto optimal equilibria, and can easily be modified to support other forms of optimality, thanks to the declarative nature of disjunctive answer set programming. Experimental results clearly demonstrate the effectiveness of the proposed method.

ECAI Conference 2014 Conference Paper

Using Ensemble Techniques and Multi-Objectivization to Solve Reinforcement Learning Problems

  • Tim Brys
  • Matthew E. Taylor
  • Ann Nowé

Recent work on multi-objectivization has shown how a single-objective reinforcement learning problem can be turned into a multi-objective problem with correlated objectives, by providing multiple reward shaping functions. The information contained in these correlated objectives can be exploited to solve the base, single-objective problem faster and better, given techniques specifically aimed at handling such correlated objectives. In this paper, we identify ensemble techniques as a set of methods that is suitable to solve multi-objectivized reinforcement learning problems. We empirically demonstrate their use on the Pursuit domain.

JAAMAS Journal 2013 Journal Article

A decentralized approach for convention emergence in multi-agent systems

  • Mihail Mihaylov
  • Karl Tuyls
  • Ann Nowé

Abstract The field of convention emergence studies how agents involved in repeated coordination games can reach consensus through only local interactions. The literature on this topic is vast and is motivated by human societies, mainly addressing coordination problems between human agents, such as who gets to redial after a dropped telephone call. In contrast, real-world engineering problems, such as coordination in wireless sensor networks, involve agents with limited resources and knowledge and thus pose certain restrictions on the complexity of the coordination mechanisms. Due to these restrictions, strategies proposed for human coordination may not be suitable for engineering applications and need to be further explored in the context of real-world application domains. In this article we take the role of designers of large decentralized multi-agent systems. We investigate the factors that speed up the convergence process of agents arranged in different static and dynamic topologies and under different interaction models, typical for engineering applications. We also study coordination problems both under partial observability and in the presence of faults (or noise). The main contributions of this article are that we propose an approach for emergent coordination, motivated by highly constrained devices, such as wireless nodes and swarm bots, in the absence of a central entity and perform extensive theoretical and empirical studies. Our approach is called Win-Stay Lose-probabilistic-Shift, generalizing two well-known strategies in game theory that have been applied in other domains. We demonstrate that our approach performs well in different settings under limited information and imposes minimal system requirements, due to its simplicity. Moreover, our technique outperforms state-of-the-art coordination mechanisms, guarantees full convergence in any topology and has the property that all convention states are absorbing.

EUMAS Conference 2011 Conference Paper

Local Coordination in Online Distributed Constraint Optimization Problems

  • Tim Brys
  • Yann-Michaël De Hauwere
  • Ann Nowé
  • Peter Vrancx

Abstract In cooperative multi-agent systems, group performance often depends more on the interactions between team members, rather than on the performance of any individual agent. Hence, coordination among agents is essential to optimize the group strategy. One solution which is common in the literature is to let the agents learn in a joint action space. Joint Action Learning (JAL) enables agents to explicitly take into account the actions of other agents, but has the significant drawback that the action space in which the agents must learn scales exponentially in the number of agents. Local coordination is a way for a team to coordinate while keeping communication and computational complexity low. It allows the exploitation of a specific dependency structure underlying the problem, such as tight couplings between specific agents. In this paper we investigate a novel approach to local coordination, in which agents learn this dependency structure, resulting in coordination which is beneficial to the group performance. We evaluate our approach in the context of online distributed constraint optimization problems.

JAAMAS Journal 2006 Journal Article

Exploring selfish reinforcement learning in repeated games with stochastic rewards

  • Katja Verbeeck
  • Ann Nowé
  • Karl Tuyls

Abstract In this paper we introduce a new multi-agent reinforcement learning algorithm, called exploring selfish reinforcement learning (ESRL). ESRL allows agents to reach optimal solutions in repeated non-zero sum games with stochastic rewards, by using coordinated exploration. First, two ESRL algorithms for respectively common interest and conflicting interest games are presented. Both ESRL algorithms are based on the same idea, i. e. an agent explores by temporarily excluding some of the local actions from its private action space, to give the team of agents the opportunity to look for better solutions in a reduced joint action space. In a latter stage these two algorithms are transformed into one generic algorithm which does not assume that the type of the game is known in advance. ESRL is able to find the Pareto optimal solution in common interest games without communication. In conflicting interest games ESRL only needs limited communication to learn a fair periodical policy, resulting in a good overall policy. Important to know is that ESRL agents are independent in the sense that they only use their own action choices and rewards to base their decisions on, that ESRL agents are flexible in learning different solution concepts and they can handle both stochastic, possible delayed rewards and asynchronous action selection. A real-life experiment, i. e. adaptive load-balancing of parallel applications is added.

KER Journal 2005 Journal Article

Evolutionary game theory and multi-agent reinforcement learning

  • Karl Tuyls
  • Ann Nowé

In this paper we survey the basics of reinforcement learning and (evolutionary) game theory, applied to the field of multi-agent systems. This paper contains three parts. We start with an overview on the fundamentals of reinforcement learning. Next we summarize the most important aspects of evolutionary game theory. Finally, we discuss the state-of-the-art of multi-agent reinforcement learning and the mathematical connection with evolutionary game theory.