Author name cluster

Sam Devlin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

2 author rows

ICML Conference 2025 Conference Paper

Scaling Laws for Pre-training Agents and World Models

Tim Pearce
Tabish Rashid
David Bignell
Raluca Georgescu
Sam Devlin
Katja Hofmann

The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent’s behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that ‘bigger is better’, we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e. g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task & architecture – this has important implications on the optimal sizing of models and data.

Details

ICLR Conference 2023 Conference Paper

Contrastive Meta-Learning for Partially Observable Few-Shot Learning

Adam Jelley
Amos J. Storkey
Antreas Antoniou
Sam Devlin

Many contrastive and meta-learning approaches learn representations by identifying common features in multiple views. However, the formalism for these approaches generally assumes features to be shared across views to be captured coherently. We consider the problem of learning a unified representation from partial observations, where useful features may be present in only some of the views. We approach this through a probabilistic formalism enabling views to map to representations with different levels of uncertainty in different components; these views can then be integrated with one another through marginalisation over that uncertainty. Our approach, Partial Observation Experts Modelling (POEM), then enables us to meta-learn consistent representations from partial observations. We evaluate our approach on an adaptation of a comprehensive few-shot learning benchmark, Meta-Dataset, and demonstrate the benefits of POEM over other meta-learning methods at representation learning from partial observations. We further demonstrate the utility of POEM by meta-learning to represent an environment from partial views observed by an agent exploring the environment.

Details

ICLR Conference 2023 Conference Paper

Imitating Human Behaviour with Diffusion Models

Tim Pearce
Tabish Rashid
Anssi Kanervisto
David Bignell
Mingfei Sun 0001
Raluca Georgescu
Sergio Valcarcel Macua
Shan Zheng Tan

Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.

Details

AAMAS Conference 2023 Conference Paper

Trust Region Bounds for Decentralized PPO Under Non-stationarity

Mingfei Sun
Sam Devlin
Jacob Beck
Katja Hofmann
Shimon Whiteson

We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i. e. , computing probability ratios separately for each agent’s policy. We show that, despite the nonstationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.

PDF

AAAI Conference 2022 Conference Paper

Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Mingfei Sun
Sam Devlin
Katja Hofmann
Shimon Whiteson

Sample efficiency is crucial for imitation learning methods to be applicable in real-world applications. Many studies improve sample efficiency by extending adversarial imitation to be off-policy regardless of the fact that these off-policy extensions could either change the original objective or involve complicated optimization. We revisit the foundation of adversarial imitation and propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our formulation capitalizes on two key insights: (1) the similarity between the Bellman equation and the stationary state-action distribution equation allows us to derive a novel temporal difference (TD) learning approach; and (2) the use of a deterministic policy simplifies the TD learning. Combined, these insights yield a practical algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which operates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.

PDF Details

NeurIPS Conference 2022 Conference Paper

Uni[MASK]: Unified Inference in Sequential Decision Problems

Micah Carroll
Orr Paradise
Jessy Lin
Raluca Georgescu
Mingfei Sun
David Bignell
Stephanie Milani
Katja Hofmann

Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the UniMASK framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models. Additionally, after fine-tuning, our UniMASK models consistently outperform comparable single-task models.

PDF Details

AAMAS Conference 2021 Conference Paper

Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Luisa Zintgraf
Sam Devlin
Kamil Ciosek
Shimon Whiteson
Katja Hofmann

Agents that interact with other agents often do not know a priori what the other agents’ strategies are, but have to maximise their own online return while interacting with and learning about others. The optimal adaptive behaviour under uncertainty over the other agents’ strategies w. r. t. some prior can in principle be computed using the Interactive Bayesian Reinforcement Learning framework. Unfortunately, doing so is intractable in most settings, and existing approximation methods are restricted to small tasks. To overcome this, we propose to meta-learn (alongside the policy) approximate belief inference by combining sequential and hierarchical VAEs. We show empirically that our approach can learn a factorised belief model that separates the other agent’s permanent and temporal structure, and outperforms methods that sample from the approximate posterior or do not have this hierarchical structure. A full version of this work can be found in Zintgraf et al. [30].

PDF

AAMAS Conference 2021 Conference Paper

Difference Rewards Policy Gradients

Jacopo Castellini
Sam Devlin
Frans A. Oliehoek
Rahul Savani

Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent’s contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr. Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr. Reinforce avoids difficulties associated with learning the 𝑄-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr. Reinforce that learns a reward network that is used to estimate the difference rewards.

PDF

AAMAS Conference 2021 Conference Paper

Evaluating the Robustness of Collaborative Agents

Paul Knott
Micah Carroll
Sam Devlin
Kamil Ciosek
Katja Hofmann
Anca Dragan
Rohin Shah

Artificial agents trained by deep reinforcement learning will likely encounter novel situations after deployment that were never seen during training. Our agent must be robust to handle such situations well. However, if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of unit testing in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in possible partner behavior and possible states encountered, and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward. For our full paper, see arxiv. org/abs/2101. 05507.

PDF

ICML Conference 2021 Conference Paper

Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation

Sam Devlin
Raluca Georgescu
Ida Momennejad
Jaroslaw Rzepecki
Evelyn Zuniga
Gavin Costello
Guy Leroy
Ali Shaw

A key challenge on the path to developing agents that learn complex human-like behavior is the need to quickly and accurately quantify human-likeness. While human assessments of such behavior can be highly accurate, speed and scalability are limited. We address these limitations through a novel automated Navigation Turing Test (ANTT) that learns to predict human judgments of human-likeness. We demonstrate the effectiveness of our automated NTT on a navigation task in a complex 3D environment. We investigate six classification models to shed light on the types of architectures best suited to this task, and validate them against data collected through a human NTT. Our best models achieve high accuracy when distinguishing true human and agent behavior. At the same time, we show that predicting finer-grained human assessment of agents’ progress towards human-like behavior remains unsolved. Our work takes an important step towards agents that more effectively learn complex human-like behavior.

Details

UAI Conference 2021 Conference Paper

Strategically efficient exploration in competitive multi-agent reinforcement learning

Robert Tyler Loftin
Aadirupa Saha
Sam Devlin
Katja Hofmann

High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty can significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.

Details

ICLR Conference 2020 Conference Paper

AMRL: Aggregated Memory For Reinforcement Learning

Jacob Beck
Kamil Ciosek
Sam Devlin
Sebastian Tschiatschek
Cheng Zhang 0005
Katja Hofmann

In many partially observable scenarios, Reinforcement Learning (RL) agents must rely on long-term memory in order to learn an optimal policy. We demonstrate that using techniques from NLP and supervised learning fails at RL tasks due to stochasticity from the environment and from exploration. Utilizing our insights on the limitations of traditional memory methods in RL, we propose AMRL, a class of models that can learn better policies with greater sample efficiency and are resilient to noisy inputs. Specifically, our models use a standard memory module to summarize short-term context, and then aggregate all prior states from the standard model without respect to order. We show that this provides advantages both in terms of gradient decay and signal-to-noise ratio over time. Evaluating in Minecraft and maze environments that test long-term memory, we find that our model improves average return by 19% over a baseline that has the same number of parameters and by 9% over a stronger baseline that has far more parameters.

Details

NeurIPS Conference 2019 Conference Paper

Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck

Maximilian Igl
Kamil Ciosek
Yingzhen Li
Sebastian Tschiatschek
Cheng Zhang
Sam Devlin
Katja Hofmann

The ability for policies to generalize to new environments is key to the broad application of RL agents. A promising approach to prevent an agent’s policy from overfitting to a limited set of training environments is to apply regularization techniques originally developed for supervised learning. However, there are stark differences between supervised learning and RL. We discuss those differences and propose modifications to existing regularization techniques in order to better adapt them to RL. In particular, we focus on regularization techniques relying on the injection of noise into the learned function, a family that includes some of the most widely used approaches such as Dropout and Batch Normalization. To adapt them to RL, we propose Selective Noise Injection (SNI), which maintains the regularizing effect the injected noise has, while mitigating the adverse effects it has on the gradient quality. Furthermore, we demonstrate that the Information Bottleneck (IB) is a particularly well suited regularization technique for RL as it is effective in the low-data regime encountered early on in training RL agents. Combining the IB with SNI, we significantly outperform current state of the art results, including on the recently proposed generalization benchmark Coinrun.

PDF Details

KER Journal 2018 Journal Article

Reward shaping for knowledge-based multi-objective multi-agent reinforcement learning

Patrick Mannion
Sam Devlin
Jim Duggan
Enda Howley

Abstract The majority of multi-agent reinforcement learning (MARL) implementations aim to optimize systems with respect to a single objective, despite the fact that many real-world problems are inherently multi-objective in nature. Research into multi-objective MARL is still in its infancy, and few studies to date have dealt with the issue of credit assignment. Reward shaping has been proposed as a means to address the credit assignment problem in single-objective MARL, however it has been shown to alter the intended goals of a domain if misused, leading to unintended behaviour. Two popular shaping methods are potential-based reward shaping and difference rewards, and both have been repeatedly shown to improve learning speed and the quality of joint policies learned by agents in single-objective MARL domains. This work discusses the theoretical implications of applying these shaping approaches to cooperative multi-objective MARL problems, and evaluates their efficacy using two benchmark domains. Our results constitute the first empirical evidence that agents using these shaping methodologies can sample true Pareto optimal solutions in cooperative multi-objective stochastic games.

Details DOI

KER Journal 2017 Journal Article

Multi-agent credit assignment in stochastic resource management games

Patrick Mannion
Sam Devlin
Jim Duggan
Enda Howley

Abstract Multi-agent systems (MASs) are a form of distributed intelligence, where multiple autonomous agents act in a common environment. Numerous complex, real world systems have been successfully optimized using multi-agent reinforcement learning (MARL) in conjunction with the MAS framework. In MARL agents learn by maximizing a scalar reward signal from the environment, and thus the design of the reward function directly affects the policies learned. In this work, we address the issue of appropriate multi-agent credit assignment in stochastic resource management games. We propose two new stochastic games to serve as testbeds for MARL research into resource management problems: the tragic commons domain and the shepherd problem domain. Our empirical work evaluates the performance of two commonly used reward shaping techniques: potential-based reward shaping and difference rewards. Experimental results demonstrate that systems using appropriate reward shaping techniques for multi-agent credit assignment can achieve near-optimal performance in stochastic resource management games, outperforming systems learning using unshaped local or global evaluations. We also present the first empirical investigations into the effect of expressing the same heuristic knowledge in state- or action-based formats, therefore developing insights into the design of multi-agent potential functions that will inform future work.

Details DOI

KER Journal 2016 Journal Article

Context-sensitive reward shaping for sparse interaction multi-agent systems

Yann-Michaël de Hauwere
Sam Devlin
Daniel Kudenko
Ann Nowé

Abstract Potential-based reward shaping is a commonly used approach in reinforcement learning to direct exploration based on prior knowledge. Both in single and multi-agent settings this technique speeds up learning without losing any theoretical convergence guarantees. However, if speed ups through reward shaping are to be achieved in multi-agent environments, a different shaping signal should be used for each context in which agents have a different subgoal or when agents are involved in a different interaction situation. This paper describes the use of context-aware potential functions in a multi-agent system in which the interactions between agents are sparse. This means that, unknown to the agents a priori, the interactions between the agents only occur sporadically in certain regions of the state space. During these interactions, agents need to coordinate in order to reach the global optimal solution. We demonstrate how different reward shaping functions can be used on top of Future Coordinating Q-learning (FCQ-learning); an algorithm capable of automatically detecting when agents should take each other into consideration. Using FCQ-learning, coordination problems can even be anticipated before the actual problems occur, allowing the problems to be solved timely. We evaluate our approach on a range of gridworld problems, as well as a simulation of air traffic control.

Details DOI

AAMAS Conference 2016 Conference Paper

Multi-Objective Dynamic Dispatch Optimisation Using Multi-Agent Reinforcement Learning (Extended Abstract)

Patrick Mannion
Karl Mason
Sam Devlin
Jim Duggan
Enda Howley

In this paper, we examine the application of Multi-Agent Reinforcement Learning (MARL) to a Dynamic Economic Emissions Dispatch problem. This is a multi-objective problem domain, where the conflicting objectives of fuel cost and emissions must be minimised. We evaluate the performance of several different MARL credit assignment structures in this domain, and our experimental results show that MARL can produce comparable solutions to those computed by Genetic Algorithms and Particle Swarm Optimisation.

PDF

KER Journal 2016 Journal Article

Overcoming incorrect knowledge in plan-based reward shaping

Kyriakos Efthymiadis
Sam Devlin
Daniel Kudenko

Abstract Reward shaping has been shown to significantly improve an agent’s performance in reinforcement learning. Plan-based reward shaping is a successful approach in which a STRIPS plan is used in order to guide the agent to the optimal behaviour. However, if the provided knowledge is wrong, it has been shown the agent will take longer to learn the optimal policy. Previously, in some cases, it was better to ignore all prior knowledge despite it only being partially incorrect. This paper introduces a novel use of knowledge revision to overcome incorrect domain knowledge when provided to an agent receiving plan-based reward shaping. Empirical results show that an agent using this method can outperform the previous agent receiving plan-based reward shaping without knowledge revision.

Details DOI

KER Journal 2016 Journal Article

Plan-based reward shaping for multi-agent reinforcement learning

Sam Devlin
Daniel Kudenko

Abstract Recent theoretical results have justified the use of potential-based reward shaping as a way to improve the performance of multi-agent reinforcement learning (MARL). However, the question remains of how to generate a useful potential function. Previous research demonstrated the use of STRIPS operator knowledge to automatically generate a potential function for single-agent reinforcement learning. Following up on this work, we investigate the use of STRIPS planning knowledge in the context of MARL. Our results show that a potential function based on joint or individual plan knowledge can significantly improve MARL performance compared with no shaping. In addition, we investigate the limitations of individual plan knowledge as a source of reward shaping in cases where the combination of individual agent plans causes conflict.

Details DOI

AAMAS Conference 2016 Conference Paper

Resource Abstraction for Reinforcement Learning in Multiagent Congestion Problems

Kleanthis Malialis
Sam Devlin
Daniel Kudenko

Real-world congestion problems (e. g. traffic congestion) are typically very complex and large-scale. Multiagent reinforcement learning (MARL) is a promising candidate for dealing with this emerging complexity by providing an autonomous and distributed solution to these problems. However, there are three limiting factors that affect the deployability of MARL approaches to congestion problems. These are learning time, scalability and decentralised coordination i. e. no communication between the learning agents. In this paper we introduce Resource Abstraction, an approach that addresses these challenges by allocating the available resources into abstract groups. This abstraction creates new reward functions that provide a more informative signal to the learning agents and aid the coordination amongst them. Experimental work is conducted on two benchmark domains from the literature, an abstract congestion problem and a realistic traffic congestion problem. The current state-of-theart for solving multiagent congestion problems is a form of reward shaping called difference rewards. We show that the system using Resource Abstraction significantly improves the learning speed and scalability, and achieves the highest possible or near-highest joint performance/social welfare for both congestion problems in large-scale scenarios involving up to 1000 reinforcement learning agents. CCS Concepts •Computing methodologies → Multi-agent systems; Multi-agent reinforcement learning;

PDF

AAAI Conference 2015 Conference Paper

Expressing Arbitrary Reward Functions as Potential-Based Advice

Anna Harutyunyan
Sam Devlin
Peter Vrancx
Ann Nowe

Effectively incorporating external advice is an important problem in reinforcement learning, especially as it moves into the real world. Potential-based reward shaping is a way to provide the agent with a specific form of additional reward, with the guarantee of policy invariance. In this work we give a novel way to incorporate an arbitrary reward function with the same guarantee, by implicitly translating it into the specific form of dynamic advice potentials, which are maintained as an auxiliary value function learnt at the same time. We show that advice provided in this way captures the input reward function in expectation, and demonstrate its efficacy empirically.

PDF Details

ECAI Conference 2014 Conference Paper

Coordinated Team Learning and Difference Rewards for Distributed Intrusion Response

Kleanthis Malialis
Sam Devlin
Daniel Kudenko

Distributed denial of service attacks constitute a rapidly evolving threat in the current Internet. Multiagent Router Throttling is a novel approach to respond to such attacks. We demonstrate that our approach can significantly scale-up using hierarchical communication and coordinated team learning. Furthermore, we incorporate a form of reward shaping called difference rewards and show that the scalability of our system is significantly improved in experiments involving over 100 reinforcement learning agents. We also demonstrate that difference rewards constitute an ideal online learning mechanism for network intrusion response. We compare our proposed approach against a popular state-of-the-art router throttling technique from the network security literature, and we show that our proposed approach significantly outperforms it. We note that our approach can be useful in other related multiagent domains.

Details

AAMAS Conference 2013 Conference Paper

Overcoming Erroneous Domain Knowledge in Plan-Based Reward Shaping

Kyriakos Efthymiadis
Sam Devlin
Daniel Kudenko

Reward shaping has been shown to significantly improve an agent’s performance in reinforcement learning. Plan-based reward shaping is a successful approach in which a STRIPS plan is used in order to guide the agent to the optimal behaviour. However, if the provided domain knowledge is wrong, it has been shown the agent will take longer to learn the optimal policy. Previously, in some cases, it was better to ignore all prior knowledge despite it only being partially erroneous. This paper introduces a novel use of knowledge revision to overcome erroneous domain knowledge when provided to an agent receiving plan-based reward shaping. Empirical results show that an agent using this method can outperform the previous agent receiving plan-based reward shaping without knowledge revision.

PDF

AAMAS Conference 2013 Conference Paper

Potential-Based Reward Shaping for POMDPs

Adam Eck
Leen-Kiat Soh
Sam Devlin
Daniel Kudenko

We address the problem of suboptimal behavior caused by short horizons during online POMDP planning. Our solution extends potential-based reward shaping from the related field of reinforcement learning to online POMDP planning in order to improve planning without increasing the planning horizon. In our extension, information about the quality of belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards, and thus achieve greater cumulative rewards.

PDF

AAMAS Conference 2012 Conference Paper

Dynamic Potential-Based Reward Shaping

Sam Devlin
Daniel Kudenko

Potential-based reward shaping can significantly improve the time needed to learn an optimal policy and, in multi-agent systems, the performance of the final joint-policy. It has been proven to not alter the optimal policy of an agent learning alone or the Nash equilibria of multiple agents learning together. However, a limitation of existing proofs is the assumption that the potential of a state does not change dynamically during the learning. This assumption often is broken, especially if the reward-shaping function is generated automatically. In this paper we prove and demonstrate a method of extending potential-based reward shaping to allow dynamic shaping and maintain the guarantees of policy invariance in the single-agent case and consistent Nash equilibria in the multi-agent case.

PDF

AAMAS Conference 2011 Conference Paper

Multi-Agent, Reward Shaping for RoboCup KeepAway

Sam Devlin
Marek Grze
#X15b;
Daniel Kudenko

This paper investigates the impact of reward shaping in multi-agent reinforcement learning as a way to incorporate domain knowledge about good strategies. In theory, potential-based reward shaping does not alter the Nash Equilibria of a stochastic game, only the exploration of the shaped agent. We demonstrate empirically the performance of statebased and state-action-based reward shaping in RoboCup KeepAway. The results illustrate that reward shaping can alter both the learning time required to reach a stable joint policy and the final group performance for better or worse.

PDF

AAMAS Conference 2011 Conference Paper

Theoretical Considerations of Potential-Based Reward Shaping for Multi-Agent Systems

Sam Devlin
Daniel Kudenko

Potential-based reward shaping has previously been proven to both be equivalent to Q-table initialisation and guarantee policy invariance in single-agent reinforcement learning. The method has since been used in multi-agent reinforcement learning without consideration of whether the theoretical equivalence and guarantees hold. This paper extends the existing proofs to similar results in multi-agent systems, providing the theoretical background to explain the success of previous empirical studies. Specifically, it is proven that the equivalence to Q-table initialisation remains and the Nash Equilibria of the underlying stochastic game are not modified. Furthermore, we demonstrate empirically that potential-based reward shaping affects exploration and, consequentially, can alter the joint policy converged upon.

PDF