Author name cluster

Guillermo Perez

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

1 author row

PRL Workshop 2025 Workshop Paper

Controller Synthesis from Deep Reinforcement Learning Policies

Florent Delgrange
Guy Avni
Anna Lukina
Christian Schilling
Ann Nowe
Guillermo Perez

We propose a novel framework to controller design in environments with a two-level structure: a known high-level graph (“map”) in which each vertex is populated by a Markov decision process, called a “room”. The framework “separates concerns” by using different design techniques for lowand high-level tasks. We apply reactive synthesis for highlevel tasks: given a specification as a logical formula over the high-level graph and a collection of low-level policies given on “concise” latent structures, we construct a “planner” that selects which low-level policy to apply in each room. We develop a reinforcement learning procedure to train lowlevel policies on latent structures, which unlike previous approaches, circumvents a model distillation step. It pairs the policy with probably approximately correct guarantees on its performance and abstraction quality, which are lifted to guarantees on the high-level task. These formal guarantees are the main advantage of the framework. Other advantages include scalability (rooms are large and their dynamics is unknown) and reusability of low-level policies. We demonstrate feasibility in challenging case studies involving agent navigation in environments with moving obstacles and visual inputs.

PDF Details

EWRL Workshop 2024 Workshop Paper

Controller Synthesis from Deep Reinforcement Learning Policies

Florent Delgrange
Guy Avni
Anna Lukina
Christian Schilling
Ann Nowe
Guillermo Perez

We propose a novel framework to controller design in environments with a two-level structure: a high-level graph in which each vertex is populated by a Markov decision process, called a ``room'', with several low-level objectives. We proceed as follows. First, we apply deep reinforcement learning (DRL) to obtain low-level policies for each room and objective. Second, we apply reactive synthesis to obtain a planner that selects which low-level policy to apply in each room. Reactive synthesis refers to constructing a planner for a given model of the environment that satisfies a given objective (typically specified as a temporal logic formula) by design. The main advantage of the framework is formal guarantees. In addition, the framework enables a "separation of concerns": low-level tasks are addressed using DRL, which enables scaling to large rooms of unknown dynamics, reward engineering is only done locally, and policies can be reused, whereas users can specify high-level tasks intuitively and naturally. The central challenge in synthesis is the need for a model of the rooms. We address this challenge by developing a DRL procedure to train concise "latent" policies together with latent abstract rooms, both paired with PAC guarantees on performance and abstraction quality. Unlike previous approaches, this circumvents a model distillation step. We demonstrate feasibility in a case study involving agent navigation in an environment with moving obstacles.

PDF

EWRL Workshop 2023 Workshop Paper

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

Raphaël Avalos
Florent Delgrange
Ann Nowe
Guillermo Perez
Diederik M Roijers

Partially Observable Markov Decision Processes (POMDPs) are useful tools to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Keeping a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is also intractable. State-of-the-art algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, but they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.

PDF

EWRL Workshop 2023 Workshop Paper

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Florent Delgrange
Ann Nowe
Guillermo Perez

Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability.

PDF

Highlights Conference 2017 Conference Abstract

Bounded-Regret Determinization of Max-Plus Automata

Guillermo Perez

We give tight complexity bounds for the problem of determining if a given max-plus automaton is bounded-regret determinizable. This generalizes results of Aminof et al. regarding determinization of max-plus automata by pruning. More precisely, we study strategies to resolve the non-determinism of a max-plus automaton on the fly: for every word (given letter by letter) the strategy yields a run of the automaton. The regret of the strategy is then defined as the maximal difference between the value of a constructed run for a word and the value assigned to that word by the automaton. We establish that, given a max-plus automaton and a regret bound, determining if there is a strategy with regret of at most that bound, is EXP-complete. Extended Abstract.

Details

IJCAI Conference 2017 Conference Paper

Reduction Techniques for Model Checking and Learning in MDPs

Suda Bharadwaj
Stephane Le Roux
Guillermo Perez
Ufuk Topcu

Omega-regular objectives in Markov decision processes (MDPs) reduce to reachability: find a policy which maximizes the probability of reaching a target set of states. Given an MDP, an initial distribution, and a target set of states, such a policy can be computed by most probabilistic model checking tools. If the MDP is only partially specified, i. e. , some prob- abilities are unknown, then model-learning techniques can be used to statistically approximate the probabilities and enable the computation of the de- sired policy. For fully specified MDPs, reducing the size of the MDP translates into faster model checking; for partially specified MDPs, into faster learning. We provide reduction techniques that al- low us to remove irrelevant transition probabilities: transition probabilities (known, or to be learned) that do not influence the maximal reachability probability. Among other applications, these reductions can be seen as a pre-processing of MDPs before model checking or as a way to reduce the number of experiments required to obtain a good approximation of an unknown MDP.

PDF Details

Highlights Conference 2015 Conference Abstract

Minimizing Regret in Quantitative Games

Guillermo Perez

Two-player zero-sum games of infinite duration and their quantitative versions are used in verification to model the interaction between a controller (Eve) and its environment (Adam). The question usually addressed is that of the existence (and computability) of a strategy for Eve that can maximize her payoff against any strategy of Adam. In this work, we are interested in strategies of Eve that minimize her regret, i. e. strategies that minimize the difference between her actual payoff and the payoff she could have achieved if she had known the strategy of Adam in advance. We give algorithms to compute the strategies of Eve that ensure minimal regret against an adversary whose choice of strategy is (i) unrestricted, (ii) limited to positional strategies, or (iii) limited to word strategies. We also establish relations between the latter version and other problems studied in the literature.

Details