Author name cluster

Thiago D. Simão

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

EWRL Workshop 2025 Workshop Paper

Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

Joshua Wendland
Markel Zubia
Roman Andriushchenko
Maris F. L. Galesloot
Milan Ceska
Henrik von Kleist
Thiago D. Simão
Maximilian Weininger

We introduce *missingness-MDPs* (miss-MDPs); a subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. Miss-MDPs capture settings where, at each step, the current state may go partially missing, that is, the state is not observed. Missingness of observations occurs dynamically and is caused by a *missingness function*, which governs the underlying probabilistic missingness process. Miss-MDPs distinguish the three types of missingness processes as a restriction on the missingness function: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Our goal is to compute a policy for a miss-MDP with an *unknown missingness function*. We propose algorithms that, by using a retrospective dataset and based on the different types of missingness processes, approximate the missingness function and, thereby, the true miss-MDP. The algorithms can approximate a subset of MAR and MNAR missingness functions, and we show that, for these, the optimal policy in the approximated model is $\varepsilon$-optimal in the true miss-MDP. The empirical evaluation confirms these findings. Additionally, it shows that our approach becomes more sample-efficient when exploiting the type of the underlying missingness process.

PDF

EWRL Workshop 2025 Workshop Paper

On Evaluating Policies for Robust POMDPs

Merlijn Krale
Eline M. Bovy
Maris F. L. Galesloot
Thiago D. Simão
Nils Jansen

Robust partially observable Markov decision processes (RPOMDPs) model partially observable sequential decision-making problems where an agent must be $\textit{robust}$ against a range of dynamics. RPOMDPs can be viewed as two-player games between an agent, which selects actions, and $\textit{nature}$, which adversarially selects dynamics. Evaluating an agent policy requires finding an adversarial nature policy, which is computationally challenging. In this paper, we advance the evaluation of agent policies for RPOMDPs in three ways. First, we discuss suitable benchmarks. We observe that for some RPOMDPs, an optimal agent policy can be found by considering only subsets of nature policies, making them easier to solve. We formalize this concept of $\textit{solvability}$ and construct three benchmarks that are only solvable for expressive sets of nature policies. Second, we describe a provably sound method to evaluate agent policies for RPOMDPs by solving an equivalent MDP. Third, we lift two well-known POMDP upper value bounds to RPOMDPs, which can be used to efficiently approximate the optimality gap of a policy and serve as baselines. Our experimental evaluation shows that (1) our proposed benchmarks cannot be solved by assuming naive nature policies, (2) our method of evaluating policies is accurate, and (3) the approximations provide solid baselines for evaluation.

PDF

ECAI Conference 2025 Conference Paper

Pessimistic Iterative Planning with RNNs for Robust POMDPs

Maris F. L. Galesloot
Marnix Suilen
Thiago D. Simão
Steven Carr 0002
Matthijs T. J. Spaan
Ufuk Topcu
Nils Jansen 0001

Robust POMDPs extend classical POMDPs to incorporate model uncertainty using so-called uncertainty sets on the transition and observation functions, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case probability instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between (1) selecting pessimistic POMDPs via worst-case probability instances from the uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Within PIP, we propose the RFSCNET algorithm, which optimizes a recurrent neural network to compute the FSCs. The empirical evaluation shows that RFSCNET can compute better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver.

Details

ICLR Conference 2025 Conference Paper

Robust Transfer of Safety-Constrained Reinforcement Learning Agents

Markel Zubia
Thiago D. Simão
Nils Jansen 0001

Reinforcement learning (RL) often relies on trial and error, which may cause undesirable outcomes. As a result, standard RL is inappropriate for safety-critical applications. To address this issue, one may train a safe agent in a controlled environment (where safety violations are allowed) and then transfer it to the real world (where safety violations may have disastrous consequences). Prior work has made this transfer safe as long as the new environment preserves the safety-related dynamics. However, in most practical applications, differences or shifts in dynamics between the two environments are inevitable, potentially leading to safety violations after the transfer. This work aims to guarantee safety even when the new environment has different (safety-related) dynamics. In other words, we aim to make the process of safe transfer robust. Our methodology (1) robustifies an agent in the controlled environment and (2) provably provides---under mild assumption---a safe transfer to new environments. The empirical evaluation shows that this method yields policies that are robust against changes in dynamics, demonstrating safety after transfer to a new environment.

Details

ICLR Conference 2025 Conference Paper

Safety-Prioritizing Curricula for Constrained Reinforcement Learning

Cevahir Köprülü
Thiago D. Simão
Nils Jansen 0001
Ufuk Topcu

Curriculum learning aims to accelerate reinforcement learning (RL) by generating curricula, i.e., sequences of tasks of increasing difficulty. Although existing curriculum generation approaches provide benefits in sample efficiency, they overlook safety-critical settings where an RL agent must adhere to safety constraints. Thus, these approaches may generate tasks that cause RL agents to violate safety constraints during training and behave suboptimally after. We develop a safe curriculum generation approach (SCG) that aligns the objectives of constrained RL and curriculum learning: improving safety during training and boosting sample efficiency. SCG generates sequences of tasks where the RL agent can be safe and performant by initially generating tasks with minimum safety violations over high-reward ones. We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.

Details

JAIR Journal 2025 Journal Article

Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies

Federico Bianchi
Alberto Castellini
Edoardo Zorzi
Thiago D. Simão
Matthijs T. J. Spaan
Alessandro Farinelli

Offline Reinforcement Learning (RL) allows policies to be trained on pre-collected datasets without requiring additional interactions with the environment. This approach bypasses the need for real-time data acquisition in real-world applications, which can be impractical due to the safety issues inherent in the learning process. However, offline RL faces significant challenges, such as distributional shifts and extrapolation errors, and the resulting policies might underperform compared to the baseline policy. Safe policy improvement algorithms mitigate these issues, enabling the reliable deployment of RL approaches in real-world scenarios where historical data is available, guaranteeing that any policy changes will not result in worse performance compared to the baseline policy used to collect training data. In this paper, we propose MCTS-SPIBB, an algorithm that leverages Monte Carlo Tree Search (MCTS) for scaling safe policy improvement to large domains. We theoretically prove that the policy generated by MCTS-SPIBB converges to the optimal safely improved policy produced by Safe Policy Improvement with Baseline Bootstrapping (SPIBB) as the number of simulations increases. Additionally, we introduce SDP-SPIBB, a novel extension of SPIBB designed to address the scalability limitations of the standard algorithm via Scalable Dynamic Programming. Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments.

PDF Details DOI

AAMAS Conference 2025 Conference Paper

Tighter Value-Function Approximations for POMDPs

Merlijn Krale
Wietze Koops
Sebastian Junges
Thiago D. Simão
Nils Jansen

Solving partially observable Markov decision processes (POMDPs) typically requires reasoning about the values of exponentially many state beliefs. Towards practical performance, state-of-the-art solvers use value bounds to guide this reasoning. However, sound upper value bounds are often computationally expensive to compute, and there is a tradeoff between the tightness of such bounds and their computational cost. This paper introduces new and provably tighter upper value bounds than the commonly used fast informed bound. Our empirical evaluation shows that, despite their additional computational overhead, the new upper bounds accelerate state-ofthe-art POMDP solvers on a wide range of benchmarks.

PDF

AAAI Conference 2024 Conference Paper

Factored Online Planning in Many-Agent POMDPs

Maris F. L. Galesloot
Thiago D. Simão
Sebastian Junges
Nils Jansen

In centralized multi-agent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation methods have been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to settings with many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-of-the-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents.

PDF Details DOI

EWRL Workshop 2024 Workshop Paper

Pessimistic Iterative Planning for Robust POMDPs

Maris F. L. Galesloot
Marnix Suilen
Thiago D. Simão
Steven Carr
Matthijs T. J. Spaan
Ufuk Topcu
Nils Jansen

Robust partially observable Markov decision processes (robust POMDPs) extend classical POMDPs to handle additional uncertainty on the transition and observation probabilities via so-called uncertainty sets. Policies for robust POMDPs must not only be memory-based to account for partial observability but also robust against model uncertainty to account for the worst-case instances from the uncertainty sets. We propose the pessimistic iterative planning (PIP) framework, which finds robust memory-based policies for robust POMDPs. PIP alternates between two main steps: (1) selecting an adversarial (non-robust) POMDP via worst-case probability instances from the uncertainty sets; and (2) computing a finite-state controller (FSC) for this adversarial POMDP. We evaluate the performance of this FSC on the original robust POMDP and use this evaluation in step (1) to select the next adversarial POMDP. Within PIP, we propose the rFSCNet algorithm. In each iteration, rFSCNet finds an FSC through a recurrent neural network by using supervision policies optimized for the adversarial POMDP. The empirical evaluation in four benchmark environments showcases improved robustness against several baseline methods and competitive performance compared to a state-of-the-art robust POMDP solver.

PDF

AAAI Conference 2024 Conference Paper

Robust Active Measuring under Model Uncertainty

Merlijn Krale
Thiago D. Simão
Jana Tumova
Nils Jansen

Partial observability and uncertainty are common problems in sequential decision-making that particularly impede the use of formal models such as Markov decision processes (MDPs). However, in practice, agents may be able to employ costly sensors to measure their environment and resolve partial observability by gathering information. Moreover, imprecise transition functions can capture model uncertainty. We combine these concepts and extend MDPs to robust active-measuring MDPs (RAM-MDPs). We present an active-measure heuristic to solve RAM-MDPs efficiently and show that model uncertainty can, counterintuitively, let agents take fewer measurements. We propose a method to counteract this behavior while only incurring a bounded additional cost. We empirically compare our methods to several baselines and show their superior scalability and performance.

PDF Details DOI

ICML Conference 2024 Conference Paper

Scalable Safe Policy Improvement for Factored Multi-Agent MDPs

Federico Bianchi 0002
Edoardo Zorzi
Alberto Castellini
Thiago D. Simão
Matthijs T. J. Spaan
Alessandro Farinelli

In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.

Details

ICAPS Conference 2023 Conference Paper

Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring

Merlijn Krale
Thiago D. Simão
Nils Jansen 0001

We study Markov decision processes (MDPs), where agents control when and how they gather information, as formalized by action-contingent noiselessly observable MDPs (ACNO-MPDs). In these models, actions have two components: a control action that influences how the environment changes and a measurement action that affects the agent

Details

IJCAI Conference 2023 Conference Paper

More for Less: Safe Policy Improvement with Stronger Performance Guarantees

Patrick Wienhöft
Marnix Suilen
Thiago D. Simão
Clemens Dubslaff
Christel Baier
Nils Jansen

In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy's performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Recursive Small-Step Multi-Agent A* for Dec-POMDPs

Wietze Koops
Nils Jansen
Sebastian Junges
Thiago D. Simão

We present recursive small-step multi-agent A* (RS-MAA*), an exact algorithm that optimizes the expected reward in decentralized partially observable Markov decision processes (Dec-POMDPs). RS-MAA* builds on multi-agent A* (MAA*), an algorithm that finds policies by exploring a search tree, but tackles two major scalability concerns. First, we employ a modified, small-step variant of the search tree that avoids the double exponential outdegree of the classical formulation. Second, we use a tight and recursive heuristic that we compute on-the-fly, thereby avoiding an expensive precomputation. The resulting algorithm is conceptually simple, yet it shows superior performance on a rich set of standard benchmarks.

PDF Details DOI

ECAI Conference 2023 Conference Paper

Reinforcement Learning by Guided Safe Exploration

Qisong Yang
Thiago D. Simão
Nils Jansen 0001
Simon H. Tindemans
Matthijs T. J. Spaan

Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.

Details

UAI Conference 2023 Conference Paper

Risk-aware curriculum generation for heavy-tailed task distributions

Cevahir Köprülü
Thiago D. Simão
Nils Jansen 0001
Ufuk Topcu

Automated curriculum generation for reinforcement learning (RL) aims to speed up learning by designing a sequence of tasks of increasing difficulty. Such tasks are usually drawn from probability distributions with exponentially bounded tails, such as uniform or Gaussian distributions. However, existing approaches overlook heavy-tailed distributions. Under such distributions, current methods may fail to learn optimal policies in rare and risky tasks, which fall under the tails and yield the lowest returns, respectively. We address this challenge by proposing a risk-aware curriculum generation algorithm that simultaneously creates two curricula: 1) a primary curriculum that aims to maximize the expected discounted return with respect to a distribution over target tasks, and an auxiliary curriculum that identifies and over-samples rare and risky tasks observed in the primary curriculum. Our empirical results evidence that the proposed algorithm achieves significantly higher returns in frequent as well as rare tasks compared to the state-of-the-art methods.

Details

AAAI Conference 2023 Conference Paper

Safe Policy Improvement for POMDPs via Finite-State Controllers

Thiago D. Simão
Marnix Suilen
Nils Jansen

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve upon the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data is available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation

Yannick Hogewind
Thiago D. Simão
Tal Kachman
Nils Jansen 0001

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) high-dimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches regarding computational requirements, final reward return, and satisfying the safety constraints.

Details

ICML Conference 2023 Conference Paper

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Alberto Castellini
Federico Bianchi 0002
Edoardo Zorzi
Thiago D. Simão
Alessandro Farinelli
Matthijs T. J. Spaan

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i. e. , only in the states actually visited by the agent.

Details

NeurIPS Conference 2022 Conference Paper

Robust Anytime Learning of Markov Decision Processes

Marnix Suilen
Thiago D. Simão
David Parker
Nils Jansen

Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. Furthermore, our method is capable of adapting to changes in the environment. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.

PDF Details

AAMAS Conference 2021 Conference Paper

AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

Thiago D. Simão
Nils Jansen
Matthijs T. J. Spaan

Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find tradeoffs between performance and safety. Unfortunately, most RL agents designed for CMDPs only guarantee safety after the learning phase, which might prevent their direct deployment. In this work, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. Factored CMDPs provide such compact models when a small subset of features describe the dynamics relevant for the safety constraints. We propose an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints. During the training process, this algorithm can seamlessly switch from a conservative policy to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely and, when they are aligned, this approach also improves the agent’s performance.

PDF

PRL Workshop 2021 Workshop Paper

AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training (Extended Abstract)

Thiago D. Simão
Nils Jansen
Matthijs T. J. Spaan

PDF Details

AAAI Conference 2021 Conference Paper

WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Qisong Yang
Thiago D. Simão
Simon H Tindemans
Matthijs T. J. Spaan

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at- Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

PDF Details

IJCAI Conference 2019 Conference Paper

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Thiago D. Simão

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy pi is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy pi'. However, the policy computed by traditional RL algorithms might have worse performance compared to pi. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of pi' is better than the performance of pi given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.

PDF Details

AAAI Conference 2019 Conference Paper

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Thiago D. Simão
Matthijs T. J. Spaan

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.

PDF Details

IJCAI Conference 2019 Conference Paper

Structure Learning for Safe Policy Improvement

Thiago D. Simão
Matthijs T. J. Spaan

We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables.

PDF Details