Arrow Research search

Author name cluster

Olivier Buffet

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
2 author rows

Possible papers

38

NeurIPS Conference 2025 Conference Paper

\(\varepsilon\)-Optimally Solving Two-Player Zero-Sum POSGs

  • Erwan escudie
  • Matthia Sabatelli
  • Olivier Buffet
  • Jilles Dibangoye

We present a novel framework for (\varepsilon)-optimally solving two-player zero-sum partially observable stochastic games (zs-POSGs). These games pose a major challenge due to the absence of a principled connection with dynamic programming (DP) techniques developed for two-player zero-sum stochastic games (zs-SGs). Prior attempts at transferring solution methods have lacked a lossless reduction—defined here as a transformation that preserves value functions, equilibrium strategies, and optimality structure—thereby limiting generalisation to ad hoc algorithms. This work introduces the first lossless reduction from zs-POSGs to transition-independent zs-SGs, enabling the principled application of a broad class of DP-based methods. We show empirically that point-based value iteration (PBVI) algorithms, applied via this reduction, produce (\varepsilon)-optimal strategies across a range of benchmark domains, consistently matching or outperforming existing state-of-the-art methods. Our results open a systematic pathway for algorithmic and theoretical transfer from SGs to partially observable settings.

AAMAS Conference 2025 Conference Paper

Observer-Aware Probabilistic Planning under Partial Observability

  • Salomé Lepers
  • Vincent Thomas
  • Olivier Buffet

We are interested in planning problems where the agent is aware of the presence of an observer, and where this observer is in a partial observability situation. The agent has to choose its strategy so as to optimize the information transmitted by observations. Building on observer-aware Markov decision processes (OAMDPs), we propose a framework to handle this type of problems and thus formalize properties such as legibility, explicability and predictability. This extension of OAMDPs to partial observability can not only handle more realistic problems, but also permits considering dynamic hidden variables of interest. We discuss theoretical properties of PO-OAMDPs and, experimenting with benchmark problems, we analyze HSVI’s convergence behavior with dedicated initializations and study the resulting strategies.

ECAI Conference 2025 Conference Paper

Post-Hoc Interpretation of POMDP Policies

  • Geoffrey Laforest
  • Olivier Buffet
  • Alexandre Niveau
  • Bruno Zanuttini

Policies for partially observable Markov decision processes are rich objects, prescribing actions to take depending on the whole history of observations and actions. Typical representations of such policies are by hyperplanes in the space of belief states, or by finite-state controllers, which are arguably not easy to interpret. We propose to redescribe policies into mappings defined on features of the current belief state, built in a systematic manner from state features. Such a mapping can in turn be represented by an intelligible object, like a decision tree, thereby providing an interpretable representation of the policy as a whole. We moreover show how our approach allows to explain the decision taken by an agent at each step of an interaction with the environment. This provides an end-to-end process, starting from a policy computed by any solver, and ending with an explanation of each decision made at execution time. We formally define our approach, investigate related computational problems, and report on experiments on several families of problems.

UAI Conference 2024 Conference Paper

Approximation Algorithms for Observer Aware MDPs

  • Shuwa Miura
  • Olivier Buffet
  • Shlomo Zilberstein

We present approximation algorithms for Observer-Aware Markov Decision Processes (OAMDPs). OAMDPs model sequential decision-making problems in which rewards depend on the beliefs of an observer about the goals, intentions, or capabilities of the observed agent. The first proposed algorithm is a grid-based value iteration (Grid-VI), which discretizes the observer’s belief into regular grids. Based on the same discretization, the second proposed algorithm is a variant of Real-Time Dynamic Programming (RTDP) called Grid-RTDP. Unlike Grid-Vi, Grid-RTDP focuses its updates on promising states using heuristic estimates. We provide theoretical guarantees of the proposed algorithms and demonstrate that Grid-RTDP has a good anytime performance comparable to the existing approach without performance guarantees.

ICML Conference 2024 Conference Paper

Solving Hierarchical Information-Sharing Dec-POMDPs: An Extensive-Form Game Approach

  • Johan Peralez
  • Aurélien Delage
  • Olivier Buffet
  • Jilles Dibangoye

A recent theory shows that a multi-player decentralized partially observable Markov decision process can be transformed into an equivalent single-player game, enabling the application of Bellman’s principle of optimality to solve the single-player game by breaking it down into single-stage subgames. However, this approach entangles the decision variables of all players at each single-stage subgame, resulting in backups with a double-exponential complexity. This paper demonstrates how to disentangle these decision variables while maintaining optimality under hierarchical information sharing, a prominent management style in our society. To achieve this, we apply the principle of optimality to solve any single-stage subgame by breaking it down further into smaller subgames, enabling us to make single-player decisions at a time. Our approach reveals that extensive-form games always exist with solutions to a single-stage subgame, significantly reducing time complexity. Our experimental results show that the algorithms leveraging these findings can scale up to much larger multi-player games without compromising optimality.

UAI Conference 2023 Conference Paper

Monte-Carlo Search for an Equilibrium in Dec-POMDPs

  • Yang You 0003
  • Vincent Thomas
  • Francis Colas
  • Olivier Buffet

Decentralized partially observable Markov decision processes (Dec-POMDPs) formalize the problem of designing individual controllers for a group of collaborative agents under stochastic dynamics and partial observability. Seeking a global optimum is difficult (NEXP complete), but seeking a Nash equilibrium - each agent policy being a best response to the other agents - is more accessible, and allowed addressing infinite-horizon problems with solutions in the form of finite state controllers. In this paper, we show that this approach can be adapted to cases where only a generative model (a simulator) of the Dec-POMDP is available. This requires relying on a simulation-based POMDP solver to construct an agent’s FSC node by node. A related process is used to heuristically derive initial FSCs. Experiment with benchmarks shows that MC-JESP is competitive with existing Dec-POMDP solvers, even better than many offline methods using explicit models.

ICRA Conference 2023 Conference Paper

Robust Robot Planning for Human-Robot Collaboration

  • Yang You 0003
  • Vincent Thomas
  • Francis Colas
  • Rachid Alami 0001
  • Olivier Buffet

In human-robot collaboration, the objectives of the human are often unknown to the robot. Moreover, even assuming a known objective, the human behavior is also uncertain. In order to plan a robust robot behavior, a key preliminary question is then: How to derive realistic human behaviors given a known objective? A major issue is that such a human behavior should itself account for the robot behavior, otherwise collaboration cannot happen. In this paper, we rely on Markov decision models, representing the uncertainty over the human objective as a probability distribution over a finite set of objective functions (inducing a distribution over human behaviors). Based on this, we propose two contributions: 1) an approach to automatically generate an uncertain human behavior (a policy) for each given objective function while accounting for possible robot behaviors; and 2) a robot planning algorithm that is robust to the above-mentioned uncertainties and relies on solving a partially observable Markov decision process (POMDP) obtained by reasoning on a distribution over human behaviors. A co-working scenario allows conducting experiments and presenting qualitative and quantitative results to evaluate our approach.

AAAI Conference 2021 Conference Paper

K-N-MOMDPs: Towards Interpretable Solutions for Adaptive Management

  • Jonathan Ferrer-Mestres
  • Thomas G. Dietterich
  • Olivier Buffet
  • Iadine Chades

In biodiversity conservation, adaptive management (AM) is the principal tool for decision making under uncertainty. AM problems are planning problems that can be modelled using Mixed Observability MDPs (MOMDPs). MOMDPs tackle decision problems where state variables are completely or partially observable. Unfortunately, MOMDP solutions (policy graphs) are too complex to be interpreted by human decision-makers. Here, we provide algorithms to solve K- N-MOMDPs, where K represents the maximum number of fully observable states and N represents the maximum number of α-vectors. Our algorithms calculate compact and more interpretable policy graphs from existing MOMDP models and solutions. We apply these algorithms to two computational sustainability applications: optimal release of biocontrol agents to prevent dengue epidemics and conservation of the threatened bird species Gouldian finch. The methods dramatically reduce the number of states and α-vectors in MOMDP problems without significantly reducing their quality. The resulting policies have small policy graphs (4- 6 nodes) that can be easily interpreted by human decisionmakers.

ECAI Conference 2020 Conference Paper

Monte Carlo Information-Oriented Planning

  • Vincent Thomas
  • Gérémy Hutin
  • Olivier Buffet

In this article, we discuss how to solve information-gathering problems expressed as ρ-POMDPs, an extension of Partially Observable Markov Decision Processes (POMDPs) whose reward ρ depends on the belief state. Point-based approaches used for solving POMDPs have been extended to solving ρ-POMDPs as belief MDPs when its reward ρ is convex in B or when it is Lipschitz-continuous. In the present paper, we build on the POMCP algorithm to propose a Monte Carlo Tree Search for ρ-POMDPs, aiming for an efficient on-line planner which can be used for any ρ function. Adaptations are required due to the belief-dependent rewards to (i) propagate more than one state at a time, and (ii) prevent biases in value estimates. An asymptotic convergence proof to Ïţ-optimal values is given when ρ is continuous. Experiments are conducted to analyze the algorithms at hand and show that they outperform myopic approaches.

ICML Conference 2020 Conference Paper

Optimally Solving Two-Agent Decentralized POMDPs Under One-Sided Information Sharing

  • Yuxuan Xie
  • Jilles Dibangoye
  • Olivier Buffet

Optimally solving decentralized partially observable Markov decision processes under either full or no information sharing received significant attention in recent years. However, little is known about how partial information sharing affects existing theory and algorithms. This paper addresses this question for a team of two agents, with one-sided information sharing—\ie both agents have imperfect information about the state of the world, but only one has access to what the other sees and does. From the perspective of a central planner, we show that the original problem can be reformulated into an equivalent information-state Markov decision process and solved as such. Besides, we prove that the optimal value function exhibits a specific form of uniform continuity. We also present a heuristic search algorithm utilizing this property and providing the first results for this family of problems.

ICAPS Conference 2020 Conference Paper

Solving K-MDPs

  • Jonathan Ferrer-Mestres
  • Thomas G. Dietterich
  • Olivier Buffet
  • Iadine Chades

Markov Decision Processes (MDPs) are employed to model sequential decision-making problems under uncertainty. Traditionally, algorithms to solve MDPs have focused on solving large state or action spaces. With increasing applications of MDPs to human-operated domains such as conservation of biodiversity and health, developing easy-to-interpret solutions is of paramount importance to increase uptake of MDP policies. Here, we define the problem of solving K-MDPs, i. e. , given an original MDP and a constraint on the number of states (K), generate a reduced state space MDP that minimizes the difference between the original optimal MDP value function and the reduced optimal K-MDP value function. Building on existing non-transitive and transitive approximate state abstraction functions, we propose a family of three algorithms based on binary search with sub-optimality bounded polynomially in a precision parameter: ϕQ*εK-MDP-ILP, ϕQ*dK-MDP and ϕa*dK-MDP. We compare these algorithms to a greedy algorithm (ϕQ*ε Greedy K-MDP) and clustering approach (k-means++ K-MDP). On randomly generated MDPs and two computational sustainability MDPs, ϕa*dK-MDP outperformed all algorithms when it could find a feasible solution. While numerous state abstraction problems have been proposed in the literature, this is the first time that the general problem of solving K-MDPs is suggested. We hope that our work will generate future research aiming at increasing the interpretability of MDP policies in human-operated domains.

ICML Conference 2018 Conference Paper

Learning to Act in Decentralized Partially Observable MDPs

  • Jilles Dibangoye
  • Olivier Buffet

We address a long-standing open problem of reinforcement learning in decentralized partially observable Markov decision processes. Previous attempts focussed on different forms of generalized policy iteration, which at best led to local optima. In this paper, we restrict attention to plans, which are simpler to store and update than policies. We derive, under certain conditions, the first near-optimal cooperative multi-agent reinforcement learning algorithm. To achieve significant scalability gains, we replace the greedy maximization by mixed-integer linear programming. Experiments show our approach can learn to act near-optimally in many finite domains from the literature.

NeurIPS Conference 2018 Conference Paper

rho-POMDPs have Lipschitz-Continuous epsilon-Optimal Value Functions

  • Mathieu Fehr
  • Olivier Buffet
  • Vincent Thomas
  • Jilles Dibangoye

Many state-of-the-art algorithms for solving Partially Observable Markov Decision Processes (POMDPs) rely on turning the problem into a “fully observable” problem—a belief MDP—and exploiting the piece-wise linearity and convexity (PWLC) of the optimal value function in this new state space (the belief simplex ∆). This approach has been extended to solving ρ-POMDPs—i. e. , for information-oriented criteria—when the reward ρ is convex in ∆. General ρ-POMDPs can also be turned into “fully observable” problems, but with no means to exploit the PWLC property. In this paper, we focus on POMDPs and ρ-POMDPs with λ ρ -Lipschitz reward function, and demonstrate that, for finite horizons, the optimal value function is Lipschitz-continuous. Then, value function approximators are proposed for both upper- and lower-bounding the optimal value function, which are shown to provide uniformly improvable bounds. This allows proposing two algorithms derived from HSVI which are empirically evaluated on various benchmark problems.

JAIR Journal 2016 Journal Article

Goal Probability Analysis in Probabilistic Planning: Exploring and Enhancing the State of the Art

  • Marcel Steinmetz
  • Jörg Hoffmann
  • Olivier Buffet

Unavoidable dead-ends are common in many probabilistic planning problems, e.g. when actions may fail or when operating under resource constraints. An important objective in such settings is MaxProb, determining the maximal probability with which the goal can be reached, and a policy achieving that probability. Yet algorithms for MaxProb probabilistic planning are severely underexplored, to the extent that there is scant evidence of what the empirical state of the art actually is. We close this gap with a comprehensive empirical analysis. We design and explore a large space of heuristic search algorithms, systematizing known algorithms and contributing several new algorithm variants. We consider MaxProb, as well as weaker objectives that we baptize AtLeastProb (requiring to achieve a given goal probabilty threshold) and ApproxProb (requiring to compute the maximum goal probability up to a given accuracy). We explore both the general case where there may be 0-reward cycles, and the practically relevant special case of acyclic planning, such as planning with a limited action-cost budget. We design suitable termination criteria, search algorithm variants, dead-end pruning methods using classical planning heuristics, and node selection strategies. We design a benchmark suite comprising more than 1000 instances adapted from the IPPC, resource-constrained planning, and simulated penetration testing. Our evaluation clarifies the state of the art, characterizes the behavior of a wide range of heuristic search algorithms, and demonstrates significant benefits of our new algorithm variants.

JAIR Journal 2016 Journal Article

Optimally Solving Dec-POMDPs as Continuous-State MDPs

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • François Charpillet

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we introduce the idea of transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This approach makes use of the fact that planning can be accomplished in a centralized offline manner, while execution can still be decentralized. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. To provide scalability, we refine this approach by combining heuristic search and compact representations that exploit the structure present in multi-agent domains, without losing the ability to converge to an optimal solution. In particular, we introduce a feature-based heuristic search value iteration (FB-HSVI) algorithm that relies on feature-based compact representations, point-based updates and efficient action selection. A theoretical analysis demonstrates that FB-HSVI terminates in finite time with an optimal solution. We include an extensive empirical analysis using well-known benchmarks, thereby demonstrating that our approach provides significant scalability improvements compared to the state of the art.

ICAPS Conference 2016 Conference Paper

Revisiting Goal Probability Analysis in Probabilistic Planning

  • Marcel Steinmetz
  • Jörg Hoffmann 0001
  • Olivier Buffet

Maximizing goal probability is an important objective in probabilistic planning, yet algorithms for its optimal solution are severely underexplored. There is scant evidence of what the empirical state of the art actually is. Focusing on heuristic search, we close this gap with a comprehensive empirical analysis of known and adapted algorithms. We explore both, the general case where there may be 0-reward cycles, and the practically relevant special case of acyclic planning, like planning with a limited action-cost budget. We consider three different algorithmic objectives. We design suitable termination criteria, search algorithm variants, dead-end pruning methods using classical planning heuristics, and node selection strategies. Our evaluation on more than 1000 benchmark instances from the IPPC, resource-constrained planning, and simulated penetration testing reveals the behavior of heuristic search, and exhibits several improvements to the state of the art.

IJCAI Conference 2015 Conference Paper

Exploiting Separability in Multiagent Planning with Continuous-State MDPs (Extended Abstract)

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • Fran
  • ccedil; ois Charpillet

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in cooperative decentralized settings, but are difficult to solve optimally (NEXP-Complete). As a new way of solving these problems, we recently introduced a method for transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function. This new Dec-POMDP formulation, which we call an occupancy MDP, allows powerful POMDP and continuous-state MDP methods to be used for the first time. However, scalability remains limited when the number of agents or problem variables becomes large. In this paper, we show that, under certain separability conditions of the optimal value function, the scalability of this approach can increase considerably. This separability is present when there is locality of interaction between agents, which can be exploited to improve performance. Unlike most previous methods, the novel continuous-state MDP algorithm retains optimality and convergence guarantees. Results show that the extension using separability can scale to a large number of agents and domain variables while maintaining optimality.

IJCAI Conference 2015 Conference Paper

Structural Results for Cooperative Decentralized Control Models

  • Jilles Steeve Dibangoye
  • Olivier Buffet
  • Olivier Simonin

The intractability in cooperative, decentralized control models is mainly due to prohibitive memory requirements in both optimal policies and value functions. The complexity analysis has emerged as the standard method to estimating the memory needed for solving a given computational problem, but complexity results may be somewhat limited. This paper introduces a general methodology— structural analysis—for the design of optimalitypreserving concise policies and value functions, which will eventually lead to the development of efficient theory and algorithms. For the first time, we show that memory requirements for policies and value functions may be asymmetric, resulting in cooperative, decentralized control models with exponential reductions in memory requirements.

ECAI Conference 2014 Conference Paper

Learning Pruning Rules for Heuristic Search Planning

  • Michal Krajnanský
  • Jörg Hoffmann 0001
  • Olivier Buffet
  • Alan Fern

When it comes to learning control knowledge for planning, most works focus on "how to do it" knowledge which is then used to make decisions regarding which actions should be applied in which state. We pursue the opposite approach of learning "how to not do it" knowledge, used to make decisions regarding which actions should not be applied in which state. Our intuition is that "bad actions" are often easier to characterize than "good" ones. An obvious application, which has not been considered by the few prior works on learning bad actions, is to use such learned knowledge as action pruning rules in heuristic search planning. Fixing a canonical rule language and an off-the-shelf learning tool, we explore a novel method for generating training data, and implement rule evaluators in state-of-the-art planners. The experiments show that the learned rules can yield dramatic savings, even when the native pruning rules of these planners, i. e. , preferred operators, are already switched on.

ECAI Conference 2014 Conference Paper

Simultaneous Tracking and Activity Recognition (STAR) using Advanced Agent-Based Behavioral Simulations

  • Arsène Fansi Tchango
  • Vincent Thomas
  • Olivier Buffet
  • Fabien Flacher
  • Alain Dutech

Tracking and understanding moving pedestrian behaviors is of major concern for a growing number of applications. Classical approaches either consider both problems separately or treat them simultaneously on the basis of limited contextual graphical models. In this paper, we consider tackling both problems jointly based on richer contextual information issued from agent-based behavioral simulators designed for realistically reproducing human behaviors within complex environments. We focus on the single target case and experimentally show that the proposed approach keeps good performances even in case of long periods of occlusion.

ECAI Conference 2014 Conference Paper

Stop-Free Strategies for Traffic Networks: Decentralized On-line Optimization

  • Mohamed Tlig
  • Olivier Buffet
  • Olivier Simonin 0001

Traffic management in large networks remains an important challenge in transportation systems. The best approach would be to use existing infrastructure and find a solution to manage the increasing flows of vehicles. Multi-agent systems and autonomous vehicles are today considered as a promising approach to deal with traffic control. In this paper, we propose a two-level decentralized multi-agent system which allows autonomous vehicles crossing the network intersections without stopping. At the first level, we use a control agent at each intersection which (1) lets the vehicles from each road pass alternately, and (2) allows them to optimally regulate their speed in its vicinity. At the second level, each agent coordinates with its neighboring agents in order to optimize the flows inside the network. We evaluate this approach empirically, with a comparison with a more opportunistic First-Come First-Served strategy. Experimental results (in simulation) are presented (measuring energy consumption), showing the advantages and disadvantages of each approach.

IJCAI Conference 2013 Conference Paper

Adaptive Management of Migratory Birds under Sea Level Rise

  • Samuel Nicol
  • Olivier Buffet
  • Takuya Iwamura
  • Iadine Chadès

The best practice method for managing ecological systems under uncertainty is adaptive management (AM), an iterative process of reducing uncertainty while simultaneously optimizing a management objective. Existing solution methods used for AM problems assume that the system dynamics are stationary, i. e. , described by one of a set of pre-defined models. In reality ecological systems are rarely stationary and evolve over time. Importantly, the effects of climate change on populations are unlikely to be captured by stationary models. Practitioners need efficient algorithms to implement AM on real-world problems. AM can be formulated as a hidden model Markov Decision Process (hmMDP), which allows the state space to be factored and shows promise for the rapid resolution of large problems. We provide an ecological dataset and performance metrics for the AM of a network of shorebird species utilizing the East Asian-Australasian flyway given uncertainty about the rate of sea level rise. The non-stationary system is modelled as a stationary POMDP containing hidden alternative models with known probabilities of transition between them. We challenge the POMDP community to exploit the simplifications allowed by structuring the AM problem as an hmMDP and improve our benchmark solutions.

IJCAI Conference 2013 Conference Paper

Optimally Solving Dec-POMDPs as Continuous-State MDPs

  • Jilles Steeve Dibangoye
  • Christopher Amato
  • Olivier Buffet
  • François Charpillet

Optimally solving decentralized partially observable Markov decision processes (Dec-POMDPs) is a hard combinatorial problem. Current algorithms search through the space of full histories for each agent. Because of the doubly exponential growth in the number of policies in this space as the planning horizon increases, these methods quickly become intractable. However, in real world problems, computing policies over the full history space is often unnecessary. True histories experienced by the agents often lie near a structured, low-dimensional manifold embedded into the history space. We show that by transforming a Dec-POMDP into a continuous-state MDP, we are able to find and exploit these low-dimensional representations. Using this novel transformation, we can then apply powerful techniques for solving POMDPs and continuous-state MDPs. By combining a general search algorithm and dimension reduction based on feature selection, we introduce a novel approach to optimally solve problems with significantly longer planning horizons than previous methods.

AAAI Conference 2012 Conference Paper

MOMDPs: A Solution for Modelling Adaptive Management Problems

  • Iadine Chades
  • Josie Carwardine
  • Tara Martin
  • Samuel Nicol
  • Regis Sabbadin
  • Olivier Buffet

In conservation biology and natural resource management, adaptive management is an iterative process of improving management by reducing uncertainty via monitoring. Adaptive management is the principal tool for conserving endangered species under global change, yet adaptive management problems suffer from a poor suite of solution methods. The common approach used to solve an adaptive management problem is to assume the system state is known and the system dynamics can be one of a set of pre-defined models. The solution method used is unsatisfactory, employing value iteration on a discretized belief MDP which restricts the study to very small problems. We show how to overcome this limitation by modelling an adaptive management problem as a restricted Mixed Observability MDP called hidden model MDP (hmMDP). We demonstrate how to simplify the value function, the backup operator and the belief update computation. We show that, although a simplified case of POMDPs, hm- MDPs are PSPACE-complete in the finite-horizon case. We illustrate the use of this model to manage a population of the threatened Gouldian finch, a bird species endemic to Northern Australia. Our simple modelling approach is an important step towards efficient algorithms for solving adaptive management problems.

AAAI Conference 2012 Conference Paper

POMDPs Make Better Hackers: Accounting for Uncertainty in Penetration Testing

  • Carlos Sarraute
  • Olivier Buffet
  • Jörg Hoffmann

Penetration Testing is a methodology for assessing network security, by generating and executing possible hacking attacks. Doing so automatically allows for regular and systematic testing. A key question is how to generate the attacks. This is naturally formulated as planning under uncertainty, i. e. , under incomplete knowledge about the network configuration. Previous work uses classical planning, and requires costly pre-processes reducing this uncertainty by extensive application of scanning methods. By contrast, we herein model the attack planning problem in terms of partially observable Markov decision processes (POMDP). This allows to reason about the knowledge available, and to intelligently employ scanning actions as part of the attack. As one would expect, this accurate solution does not scale. We devise a method that relies on POMDPs to find good attacks on individual machines, which are then composed into an attack on the network as a whole. This decomposition exploits network structure to the extent possible, making targeted approximations (only) where needed. Evaluating this method on a suitably adapted industrial test suite, we demonstrate its effectiveness in both runtime and solution quality.

EWRL Workshop 2011 Conference Paper

Active Learning of MDP Models

  • Mauricio Araya-López
  • Olivier Buffet
  • Vincent Thomas
  • François Charpillet

Abstract We consider the active learning problem of inferring the transition model of a Markov Decision Process by acting and observing transitions. This is particularly useful when no reward function is a priori defined. Our proposal is to cast the active learning task as a utility maximization problem using Bayesian reinforcement learning with belief-dependent rewards. After presenting three possible performance criteria, we derive from them the belief-dependent rewards to be used in the decision-making process. As computing the optimal Bayesian value function is intractable for large horizons, we use a simple algorithm to approximately solve this optimization problem. Despite the sub-optimality of this technique, we show experimentally that our proposal is efficient in a number of domains.

NeurIPS Conference 2010 Conference Paper

A POMDP Extension with Belief-dependent Rewards

  • Mauricio Araya
  • Olivier Buffet
  • Vincent Thomas
  • Françcois Charpillet

Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e. g. , problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce rho-POMDPs, an extension of POMDPs where the reward function rho depends on the belief state. We show that, under the common assumption that rho is convex, the value function is also convex, what makes it possible to (1) approximate rho arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes.

AAMAS Conference 2010 Conference Paper

Influence of Different Execution Models on Patrolling Ants Behaviors: from Agents to Robots

  • Arnaud Glad
  • Olivier Simonin
  • Olivier Buffet
  • Fran
  • ccedil; ois Charpillet

Generally, swarm models and algorithms consider synchronous agents, i. e. , they act simultaneously. This hypothesisdoes not fit multi-agent simulators nor robotic systems. Inthis paper, we consider such issues on a patrolling ant algorithm. We examine how different execution hypothesesinfluence self-organization capabilities and patrolling performances of this algorithm. We consider the mono and multiagent cases, the synchronism and determinism hypotheses, and the execution of the model with real robots.

AIJ Journal 2009 Journal Article

The factored policy-gradient planner

  • Olivier Buffet
  • Douglas Aberdeen

We present an any-time concurrent probabilistic temporal planner (CPTP) that includes continuous and discrete uncertainties and metric functions. Rather than relying on dynamic programming our approach builds on methods from stochastic local policy search. That is, we optimise a parameterised policy using gradient ascent. The flexibility of this policy-gradient approach, combined with its low memory use, the use of function approximation methods and factorisation of the policy, allow us to tackle complex domains. This factored policy gradient (FPG) planner can optimise steps to goal, the probability of success, or attempt a combination of both. We compare the FPG planner to other planners on CPTP domains, and on simpler but better studied non-concurrent non-temporal probabilistic planning (PP) domains. We present FPG-ipc, the PP version of the planner which has been successful in the probabilistic track of the fifth international planning competition.

ECAI Conference 2008 Conference Paper

Theoretical Study of Ant-based Algorithms for Multi-Agent Patrolling

  • Arnaud Glad
  • Olivier Simonin 0001
  • Olivier Buffet
  • François Charpillet

This paper addresses the multi-agent patrolling problem, which consists for a set of autonomous agents to visit all the places of an unknown environment as regularly as possible. The proposed approach is based on the ant paradigm. Each agent can only mark and move according to its local perception of the environment. We study EVAW, a pheromone-based variant of the EVAP [3] and VAW [12]. The main novelty of the paper is the proof of some emergent spatial properties of the proposed algorithm. In particular we show that obtained cycles are necessarily of same length, which ensures an efficient spatial distribution of the agents. We also report some experimental results and discuss open questions concerning the proposed algorithm.

IJCAI Conference 2007 Conference Paper

  • Elena Kelareva
  • Olivier Buffet
  • Jinbo Huang
  • Sylvie Thi
  • eacute; baux

Improving AI planning algorithms relies on the ability to exploit the structure of the problem at hand. A promising direction is that of factored planning, where the domain is partitioned into subdomains with as little interaction as possible. Recent work in this field has led to an detailed theoretical analysis of such approaches and to a couple of high-level planning algorithms, but with no practical implementations or with limited experimentations. This paper presents dTreePlan, a new generic factored planning algorithm which uses a decomposition tree to efficiently partition the domain. We discuss some of its aspects, progressively describing a specific implementation before presenting experimental results. This prototype algorithm is a promising contribution---with major possible improvements---and helps enrich the picture of factored planning approaches.

ICAPS Conference 2007 Conference Paper

Concurrent Probabilistic Temporal Planning with Policy-Gradients

  • Douglas Aberdeen
  • Olivier Buffet

We present an any-time concurrent probabilistic temporal planner that includes continuous and discrete uncertainties and metric functions. Our approach is a direct policy search that attempts to optimise a parameterised policy using gradient ascent. Low memory use, plus the use of function approximation methods, plus factorisation of the policy, allow us to scale to challenging domains. This Factored Policy Gradient (FPG) Planner also attempts to optimise emph{both} steps to goal and the probability of success. We compare the FPG planner to other planners on CPTP domains, and on simpler but better studied probabilistic non-temporal domains.

ICAPS Conference 2007 Conference Paper

FF + FPG: Guiding a Policy-Gradient Planner

  • Olivier Buffet
  • Douglas Aberdeen

The Factored Policy-Gradient planner (FPG) was a successful competitor in the probabilistic track of the 2006 International Planning Competition (IPC). FPG is innovative because it scales to large planning domains through the use of Reinforcement Learning. It essentially performs a stochastic local search in policy space. FPG's weakness is potentially long learning times, as it initially acts randomly and progressively improves its policy each time the goal is reached. This paper shows how to use an external teacher to guide FPG's exploration. While any teacher can be used, we concentrate on the actions suggested by FF's heuristic as FF-replan has proved efficient for probabilistic re-planning. To achieve this, FPG must learn its own policy while following another. We thus extend FPG to off-policy learning using importance sampling. The resulting algorithm is presented and evaluated on IPC benchmarks.

JAAMAS Journal 2007 Journal Article

Shaping multi-agent systems with gradient reinforcement learning

  • Olivier Buffet
  • Alain Dutech
  • François Charpillet

Abstract An original reinforcement learning (RL) methodology is proposed for the design of multi-agent systems. In the realistic setting of situated agents with local perception, the task of automatically building a coordinated system is of crucial importance. To that end, we design simple reactive agents in a decentralized way as independent learners. But to cope with the difficulties inherent to RL used in that framework, we have developed an incremental learning algorithm where agents face a sequence of progressively more complex tasks. We illustrate this general framework by computer experiments where agents have to coordinate to reach a global goal.

IJCAI Conference 2005 Conference Paper

Robust Planning with (L)RTDP

  • Olivier Buffet
  • Douglas

Stochastic Shortest Path problems (SSPs), a subclass of Markov Decision Problems (MDPs), can be efficiently dealt with using Real-Time Dynamic Programming (RTDP). Yet, MDP models are often uncertain (obtained through statistics or guessing). The usual approach is robust planning: searching for the best policy under the worst model. This paper shows how RTDP can be made robust in the common case where transition probabilities are known to lie in a given interval.