Arrow Research search

Author name cluster

Pascal Poupart

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

93 papers
2 author rows

Possible papers

93

TMLR Journal 2026 Journal Article

A Practical Algorithm for Feature-Rich, Non-Stationary Bandit Problems

  • William Loh
  • Sajib Kumer Sinha
  • Ankur Agarwal
  • Pascal Poupart

Contextual bandits are incredibly useful in many practical problems. We go one step further by devising a more realistic problem that combines: (1) contextual bandits with dense arm features, (2) non-linear reward functions, and (3) a generalization of correlated bandits where reward distributions change over time but the degree of correlation maintains. This formulation lends itself to a wider set of applications such as recommendation tasks. To solve this problem, we introduce *conditionally coupled contextual* ($C_3$) Thompson sampling for Bernoulli bandits. It combines an improved Nadaraya-Watson estimator on an embedding space with Thompson sampling that allows online learning without retraining. Empirical results show that $C_3$ outperforms the next best algorithm by 5.7% lower average cumulative regret on four OpenML tabular datasets as well as demonstrating a 12.4% click lift on Microsoft News Dataset (MIND) compared to other algorithms.

TMLR Journal 2026 Journal Article

FedLog: Personalized Federated Classification with Less Communication and More Flexibility

  • Haolin Yu
  • Guojun Zhang
  • Hongliang Li
  • Pascal Poupart

Federated representation learning (FRL) aims to learn personalized federated models with effective feature extraction from local data. FRL algorithms that share the majority of the model parameters face significant challenges with huge communication overhead. This overhead stems from the millions of neural network parameters and slow aggregation progress of the averaging heuristic. To reduce the overhead, we propose FedLog, which shares sufficient data summaries instead of raw model parameters. The data summaries encode minimal sufficient statistics of an exponential family, and Bayesian inference is utilized for global aggregation. FedLog helps reduce message sizes and communication frequency. We prove that the shared messages are minimal sufficient statistics and theoretically analyze the convergence rate of FedLog. To further ensure formal privacy guarantees, we extend FedLog with the differential privacy framework. Empirical results demonstrate high learning accuracy with low communication overhead of our method.

TMLR Journal 2025 Journal Article

A Comprehensive Survey on Inverse Constrained Reinforcement Learning: Definitions, Progress and Challenges

  • Guiliang Liu
  • Sheng Xu
  • Shicheng Liu
  • Ashish Gaurav
  • Sriram Ganapathi Subramanian
  • Pascal Poupart

Inverse Constrained Reinforcement Learning (ICRL) is the task of inferring the implicit constraints that expert agents adhere to, based on their demonstration data. As an emerging research topic, ICRL has received considerable attention in recent years. This article presents a categorical survey of the latest advances in ICRL. It serves as a comprehensive reference for machine learning researchers and practitioners, as well as starters seeking to comprehend the definitions, advancements, and important challenges in ICRL. We begin by formally defining the problem and outlining the algorithmic framework that facilitates constraint inference across various scenarios. These include deterministic or stochastic environments, environments with limited demonstrations, and multiple agents. For each context, we illustrate the critical challenges and introduce a series of fundamental methods to tackle these issues. This survey encompasses discrete, virtual, and realistic environments for evaluating ICRL agents. We also delve into the most pertinent applications of ICRL, such as autonomous driving, robot control, and sports analytics. To stimulate continuing research, we conclude the survey with a discussion of key unresolved questions in ICRL that can effectively foster a bridge between theoretical understanding and practical industrial applications. The papers referenced in this survey can be found at https://github.com/Jasonxu1225/Awesome-Constraint-Inference-in-RL.

ICML Conference 2025 Conference Paper

Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

  • Jihwan Jeong
  • Xiaoyu Wang 0018
  • Jingmin Wang
  • Scott Sanner
  • Pascal Poupart

Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a novel doubly Bayesian offline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

ICML Conference 2025 Conference Paper

Towards Cost-Effective Reward Guided Text Generation

  • Ahmad Rashid
  • Ruotian Wu
  • Rongqi Fan
  • Hongliang Li
  • Agustinus Kristiadi
  • Pascal Poupart

Reward-guided text generation (RGTG) has emerged as a viable alternative to offline reinforcement learning from human feedback (RLHF). RGTG methods can align baseline language models to human preferences without further training as in standard RLHF methods. However, they rely on a reward model to score each candidate token generated by the language model at inference, incurring significant test-time overhead. Additionally, the reward model is usually only trained to score full sequences, which can lead to sub-optimal choices for partial sequences. In this work, we present a novel reward model architecture that is trained, using a Bradley-Terry loss, to prefer the optimal expansion of a sequence with just a single call to the reward model at each step of the generation process. That is, a score for all possible candidate tokens is generated simultaneously, leading to efficient inference. We theoretically analyze various RGTG reward models and demonstrate that prior techniques prefer sub-optimal sequences compared to our method during inference. Empirically, our reward model leads to significantly faster inference than other RGTG methods. It requires fewer calls to the reward model and performs competitively compared to previous RGTG and offline RLHF methods.

ICLR Conference 2025 Conference Paper

Understanding Constraint Inference in Safety-Critical Inverse Reinforcement Learning

  • Bo Yue
  • Shufan Wang
  • Ashish Gaurav
  • Jian Li
  • Pascal Poupart
  • Guiliang Liu

In practical applications, the underlying constraint knowledge is often unknown and difficult to specify. To address this issue, recent advances in Inverse Constrained Reinforcement Learning (ICRL) have focused on inferring these constraints from expert demonstrations. However, the ICRL approach typically characterizes constraint learning as a tri-level optimization problem, which is inherently complex due to its interdependent variables and multiple layers of optimization. Considering these challenges, a critical question arises: *Can we implicitly embed constraint signals into reward functions and effectively solve this problem using a classic reward inference algorithm?* The resulting method, known as Inverse Reward Correction (IRC), merits investigation. In this work, we conduct a theoretical analysis comparing the sample complexities of both solvers. Our findings confirm that the IRC solver achieves lower sample complexity than its ICRL counterpart. Nevertheless, this reduction in complexity comes at the expense of generalizability. Specifically, in the target environment, the reward correction terms may fail to guarantee the safety of the resulting policy, whereas this issue can be effectively mitigated by transferring the constraints via the ICRL solver. Advancing our inquiry, we investigate conditions under which the ICRL solver ensures $\epsilon$-optimality when transferring to new environments. Empirical results across various environments validate our theoretical findings, underscoring the nuanced trade-offs between complexity reduction and generalizability in safety-critical applications.

TMLR Journal 2025 Journal Article

When Should Reinforcement Learning Use Causal Reasoning?

  • Oliver Schulte
  • Pascal Poupart

Reinforcement learning (RL) and causal reasoning naturally complement each other. The goal of causal reasoning is to predict the effects of interventions in an environment, while the goal of reinforcement learning is to select interventions that maximize the rewards the agent receives from the environment. Reinforcement learning includes the two most powerful sources of information for estimating causal relationships: temporal ordering and the ability to act on an environment. This paper provides a theoretical study examining which reinforcement learning settings we can expect to benefit from causal reasoning, and how. According to our analysis, the key factor is {\em whether the behavioral policy---which generates the data---can be executed by the learning agent}, meaning that the observation signal available to the learning agent comprises all observations used by the behavioral policy. Common RL settings with behavioral policies that are executable by the learning agent include on-policy learning and online exploration, where the learning agent uses a behavioral policy to explore the environment. Common RL settings with behavioral policies that are not executable by the learning agent include offline learning with a partially observable state space and asymmetric imitation learning where the demonstrator has access to more observations than the imitator. Using the theory of causal graphs, we show formally that when the behavioral policy is executable by the learning agent, conditional probabilities are causal, and can therefore be used to estimate expected rewards as done in traditional RL. However, when the behavioral policy is not executable by the learning agent, conditional probabilities may be confounded and provide misleading estimates of expected rewards. For confounded settings, we describe previous and new methods for leveraging causal reasoning.

RLJ Journal 2024 Journal Article

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

  • Yudong Luo
  • Yangchen Pan
  • Han Wang
  • Philip Torr
  • Pascal Poupart

Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.

RLC Conference 2024 Conference Paper

A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization

  • Yudong Luo
  • Yangchen Pan
  • Han Wang
  • Philip Torr
  • Pascal Poupart

Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.

ICML Conference 2024 Conference Paper

A Sober Look at LLMs for Material Discovery: Are They Actually Good for Bayesian Optimization Over Molecules?

  • Agustinus Kristiadi
  • Felix Strieth-Kalthoff
  • Marta Skreta
  • Pascal Poupart
  • Alán Aspuru-Guzik
  • Geoff Pleiss

Automation is one of the cornerstones of contemporary material discovery. Bayesian optimization (BO) is an essential part of such workflows, enabling scientists to leverage prior domain knowledge into efficient exploration of a large molecular space. While such prior knowledge can take many forms, there has been significant fanfare around the ancillary scientific knowledge encapsulated in large language models (LLMs). However, existing work thus far has only explored LLMs for heuristic materials searches. Indeed, recent work obtains the uncertainty estimate—an integral part of BO—from point-estimated, non-Bayesian LLMs. In this work, we study the question of whether LLMs are actually useful to accelerate principled Bayesian optimization in the molecular space. We take a sober, dispassionate stance in answering this question. This is done by carefully (i) viewing LLMs as fixed feature extractors for standard but principled BO surrogate models and by (ii) leveraging parameter-efficient finetuning methods and Bayesian neural networks to obtain the posterior of the LLM surrogate. Our extensive experiments with real-world chemistry problems show that LLMs can be useful for BO over molecules, but only if they have been pretrained or finetuned with domain-specific data.

AAAI Conference 2024 Conference Paper

Calibrated One Round Federated Learning with Bayesian Inference in the Predictive Space

  • Mohsin Hasan
  • Guojun Zhang
  • Kaiyang Guo
  • Xi Chen
  • Pascal Poupart

Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client’s dataset is localized and possibly heterogeneous. In FL, small and noisy datasets are common, highlighting the need for well-calibrated models that represent the uncertainty of predictions. The closest FL techniques to achieving such goals are the Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. To improve scalability for larger models, one common Bayesian approach is to approximate the global predictive posterior by multiplying local predictive posteriors. In this work, we demonstrate that this method gives systematically overconfident predictions, and we remedy this by proposing β-Predictive Bayes, a Bayesian FL algorithm that interpolates between a mixture and product of the predictive posteriors, using a tunable parameter β. This parameter is tuned to improve the global ensemble’s calibration, before it is distilled to a single model. Our method is evaluated on a variety of regression and classification datasets to demonstrate its superiority in calibration to other baselines, even as data heterogeneity increases. Code available at https://github.com/hasanmohsin/betaPredBayesFL. Our paper's full version is at https://arxiv.org/abs/2312.09817.

ICML Conference 2024 Conference Paper

Confidence Aware Inverse Constrained Reinforcement Learning

  • Sriram Ganapathi Subramanian
  • Guiliang Liu
  • Mohammed Elmahgiubi
  • Kasra Rezaee
  • Pascal Poupart

In coming up with solutions to real-world problems, humans implicitly adhere to constraints that are too numerous and complex to be specified completely. However, reinforcement learning (RL) agents need these constraints to learn the correct optimal policy in these settings. The field of Inverse Constraint Reinforcement Learning (ICRL) deals with this problem and provides algorithms that aim to estimate the constraints from expert demonstrations collected offline. Practitioners prefer to know a measure of confidence in the estimated constraints, before deciding to use these constraints, which allows them to only use the constraints that satisfy a desired level of confidence. However, prior works do not allow users to provide the desired level of confidence for the inferred constraints. This work provides a principled ICRL method that can take a confidence level with a set of expert demonstrations and outputs a constraint that is at least as constraining as the true underlying constraint with the desired level of confidence. Further, unlike previous methods, this method allows a user to know if the number of expert trajectories is insufficient to learn a constraint with a desired level of confidence, and therefore collect more expert trajectories as required to simultaneously learn constraints with the desired level of confidence and a policy that achieves the desired level of performance.

JMLR Journal 2024 Journal Article

Label Alignment Regularization for Distribution Shift

  • Ehsan Imani
  • Guojun Zhang
  • Runjia Li
  • Jun Luo
  • Pascal Poupart
  • Philip H.S. Torr
  • Yangchen Pan

Recent work has highlighted the label alignment property (LAP) in supervised learning, where the vector of all labels in the dataset is mostly in the span of the top few singular vectors of the data matrix. Drawing inspiration from this observation, we propose a regularization method for unsupervised domain adaptation that encourages alignment between the predictions in the target domain and its top singular vectors. Unlike conventional domain adaptation approaches that focus on regularizing representations, we instead regularize the classifier to align with the unsupervised target data, guided by the LAP in both the source and target domains. Theoretical analysis demonstrates that, under certain assumptions, our solution resides within the span of the top right singular vectors of the target domain data and aligns with the optimal solution. By removing the reliance on the commonly used optimal joint risk assumption found in classic domain adaptation theory, we showcase the effectiveness of our method on addressing problems where traditional domain adaptation methods often fall short due to high joint error. Additionally, we report improved performance over domain adaptation baselines in well-known tasks such as MNIST-USPS domain adaptation and cross-lingual sentiment analysis. An implementation is available at https://github.com/EhsanEI/lar/. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

  • Yanting Miao
  • William Loh
  • Suraj Kothawade
  • Pascal Poupart
  • Abdullah Rashwan
  • Yeqing Li

Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the $\lambda$-Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization. By combining the Bradley-Terry preference model, the $\lambda$-Harmonic reward function also provides preference labels for subject-driven generation tasks. We propose Reward Preference Optimization (RPO), which offers a simpler setup (requiring only 3\% of the negative samples used by DreamBooth) and fewer gradient steps for fine-tuning. Unlike most existing methods, our approach does not require training a text encoder or optimizing text embeddings and achieves text-image alignment by fine-tuning only the U-Net component. Empirically, $\lambda$-Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks. Based on preference labels and early stopping validation from the $\lambda$-Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0. 833 and a CLIP-T score of 0. 314 on DreamBench.

NeurIPS Conference 2023 Conference Paper

An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient

  • Yudong Luo
  • Guiliang Liu
  • Pascal Poupart
  • Yangchen Pan

Restricting the variance of a policy’s return is a popular choice in risk-averse Reinforcement Learning (RL) due to its clear mathematical definition and easy interpretability. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods, such as sensitivity to numerical scale and hindering of policy learning, and propose to use an alternative risk measure, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. Empirical evaluation in domains where risk-aversion can be clearly defined, shows that our algorithm can mitigate the limitations of variance-based risk measures and achieves high return with low risk in terms of variance and Gini deviation when others fail to learn a reasonable policy.

NeurIPS Conference 2023 Conference Paper

Batchnorm Allows Unsupervised Radial Attacks

  • Amur Ghose
  • Apurv Gupta
  • Yaoliang Yu
  • Pascal Poupart

The construction of adversarial examples usually requires the existence of soft or hard labels for each instance, with respect to which a loss gradient provides the signal for construction of the example. We show that for batch normalized deep image recognition architectures, intermediate latents that are produced after a batch normalization step by themselves suffice to produce adversarial examples using an intermediate loss solely utilizing angular deviations, without relying on any label. We motivate our loss through the geometry of batch normed representations and their concentration of norm on a hypersphere and distributional proximity to Gaussians. Our losses expand intermediate latent based attacks that usually require labels. The success of our method implies that leakage of intermediate representations may create a security breach for deployed models, which persists even when the model is transferred to downstream usage. Removal of batch norm weakens our attack, indicating it contributes to this vulnerability. Our attacks also succeed against LayerNorm empirically, thus being relevant for transformer architectures, most notably vision transformers which we analyze.

ICLR Conference 2023 Conference Paper

Benchmarking Constraint Inference in Inverse Reinforcement Learning

  • Guiliang Liu
  • Yudong Luo
  • Ashish Gaurav
  • Kasra Rezaee
  • Pascal Poupart

When deploying Reinforcement Learning (RL) agents into a physical system, we must ensure that these agents are well aware of the underlying constraints. In many real-world problems, however, the constraints are often hard to specify mathematically and unknown to the RL agents. To tackle these issues, Inverse Constrained Reinforcement Learning (ICRL) empirically estimates constraints from expert demonstrations. As an emerging research topic, ICRL does not have common benchmarks, and previous works tested algorithms under hand-crafted environments with manually-generated expert demonstrations. In this paper, we construct an ICRL benchmark in the context of RL application domains, including robot control, and autonomous driving. For each environment, we design relevant constraints and train expert agents to generate demonstration data. Besides, unlike existing baselines that learn a deterministic constraint, we propose a variational ICRL method to model a posterior distribution of candidate constraints. We conduct extensive experiments on these algorithms under our benchmark and show how they can facilitate studying important research challenges for ICRL. The benchmark, including the instructions for reproducing ICRL algorithms, is available at https://github.com/Guiliang/ICRL-benchmarks-public.

AAMAS Conference 2023 Conference Paper

Defensive Collaborative Learning: Protecting Objective Privacy in Data Sharing

  • Cynthia Huang
  • Pascal Poupart

In collaborative machine learning, protecting proprietary information and safeguarding competitive advantages are crucial for participating organizations. This necessitates the development of algorithms that target a general notion of privacy defined by the data owner: objective privacy. In this paper, we formalize the idea of objective privacy as the protection of private value propositions characterized by predictive functions. We propose Defensive Collaborative Learning (DCL), where participants share data collaboratively while safeguarding their objective privacy. Formulating a min-max optimization problem that trades off utility and privacy protection, we propose algorithms that leverage mutual information backpropagation in both decentralized and centralized settings. Empirical studies show that the proposed algorithms protect objective privacy while enabling data sharing.

AAMAS Conference 2023 Conference Paper

FedFormer: Contextual Federation with Attention in Reinforcement Learning

  • Liam Hebert
  • Lukasz Golab
  • Pascal Poupart
  • Robin Cohen

A core issue in multi-agent federated reinforcement learning is defining how to aggregate insights from multiple agents. This is commonly done by taking the average of each participating agent’s model weights into one common model (FedAvg). We instead propose FedFormer, a novel federation strategy that utilizes Transformer Attention to contextually aggregate embeddings from models originating from different learner agents. In so doing, we attentively weigh the contributions of other agents with respect to the current agent’s environment and learned relationships, thus providing a more effective and efficient federation. We evaluate our methods on the Meta-World environment and find that our approach yields significant improvements over FedAvg and nonfederated Soft Actor-Critic single-agent methods. Our results compared to Soft Actor-Critic show that FedFormer achieves higher episodic return while still abiding by the privacy constraints of federated learning. Finally, we also demonstrate improvements in effectiveness with increased agent pools across all methods in certain tasks. This is contrasted by FedAvg, which fails to make noticeable improvements when scaled.

ICLR Conference 2023 Conference Paper

Learning Soft Constraints From Constrained Expert Demonstrations

  • Ashish Gaurav
  • Kasra Rezaee
  • Guiliang Liu
  • Pascal Poupart

Inverse reinforcement learning (IRL) methods assume that the expert data is generated by an agent optimizing some reward function. However, in many settings, the agent may optimize a reward function subject to some constraints, where the constraints induce behaviors that may be otherwise difficult to express with just a reward function. We consider the setting where the reward function is given, and the constraints are unknown, and propose a method that is able to recover these constraints satisfactorily from the expert data. While previous work has focused on recovering hard constraints, our method can recover cumulative soft constraints that the agent satisfies on average per episode. In IRL fashion, our method solves this problem by adjusting the constraint function iteratively through a constrained optimization procedure, until the agent behavior matches the expert behavior. We demonstrate our approach on synthetic environments, robotics environments and real world highway driving scenarios.

NeurIPS Conference 2023 Conference Paper

Multi-Modal Inverse Constrained Reinforcement Learning from a Mixture of Demonstrations

  • Guanren Qiao
  • Guiliang Liu
  • Pascal Poupart
  • Zhiqiang Xu

Inverse Constraint Reinforcement Learning (ICRL) aims to recover the underlying constraints respected by expert agents in a data-driven manner. Existing ICRL algorithms typically assume that the demonstration data is generated by a single type of expert. However, in practice, demonstrations often comprise a mixture of trajectories collected from various expert agents respecting different constraints, making it challenging to explain expert behaviors with a unified constraint function. To tackle this issue, we propose a Multi-Modal Inverse Constrained Reinforcement Learning (MMICRL) algorithm for simultaneously estimating multiple constraints corresponding to different types of experts. MMICRL constructs a flow-based density estimator that enables unsupervised expert identification from demonstrations, so as to infer the agent-specific constraints. Following these constraints, MMICRL imitates expert policies with a novel multi-modal constrained policy optimization objective that minimizes the agent-conditioned policy entropy and maximizes the unconditioned one. To enhance robustness, we incorporate this objective into the contrastive learning framework. This approach enables imitation policies to capture the diversity of behaviors among expert agents. Extensive experiments in both discrete and continuous environments show that MMICRL outperforms other baselines in terms of constraint recovery and control performance.

AAAI Conference 2022 Conference Paper

Decentralized Mean Field Games

  • Sriram Ganapathi Subramanian
  • Matthew E. Taylor
  • Mark Crowley
  • Pascal Poupart

Multiagent reinforcement learning algorithms have not been widely adopted in large scale environments with many agents as they often scale poorly with the number of agents. Using mean field theory to aggregate agents has been proposed as a solution to this problem. However, almost all previous methods in this area make a strong assumption of a centralized system where all the agents in the environment learn the same policy and are effectively indistinguishable from each other. In this paper, we relax this assumption about indistinguishable agents and propose a new mean field system known as Decentralized Mean Field Games, where each agent can be quite different from others. All agents learn independent policies in a decentralized fashion, based on their local observations. We define a theoretical solution concept for this system and provide a fixed point guarantee for a Q-learning based algorithm in this system. A practical consequence of our approach is that we can address a ‘chicken-and-egg’ problem in empirical mean field reinforcement learning algorithms. Further, we provide Q-learning and actor-critic algorithms that use the decentralized mean field learning approach and give stronger performances compared to common baselines in this area. In our setting, agents do not need to be clones of each other and learn in a fully decentralized fashion. Hence, for the first time, we show the application of mean field learning methods in fully competitive environments, large-scale continuous action space environments, and other environments with heterogeneous agents. Importantly, we also apply the mean field method in a ride-sharing problem using a real-world dataset. We propose a decentralized solution to this problem, which is more practical than existing centralized training methods.

ICLR Conference 2022 Conference Paper

Distributional Reinforcement Learning with Monotonic Splines

  • Yudong Luo
  • Guiliang Liu
  • Haonan Duan 0002
  • Oliver Schulte
  • Pascal Poupart

Distributional Reinforcement Learning (RL) differs from traditional RL by estimating the distribution over returns to capture the intrinsic uncertainty of MDPs. One key challenge in distributional RL lies in how to parameterize the quantile function when minimizing the Wasserstein metric of temporal differences. Existing algorithms use step functions or piecewise linear functions. In this paper, we propose to learn smooth continuous quantile functions represented by monotonic rational-quadratic splines, which also naturally solve the quantile crossing problem. Experiments in stochastic environments show that a dense estimation for quantile functions enhances distributional RL in terms of faster empirical convergence and higher rewards in most cases.

UAI Conference 2022 Conference Paper

Learning functions on multiple sets using multi-set transformers

  • Kira A. Selby
  • Ahmad Rashid
  • Ivan Kobyzev
  • Mehdi Rezagholizadeh
  • Pascal Poupart

We propose a general deep architecture for learning functions on multiple permutation-invariant sets. We also show how to generalize this architecture to sets of elements of any dimension by dimension equivariance. We demonstrate that our architecture is a universal approximator of these functions, and show superior results to existing methods on a variety of tasks including counting tasks, alignment tasks, distinguishability tasks and statistical distance measurements. This last task is quite important in Machine Learning. Although our approach is quite general, we demonstrate that it can generate approximate estimates of KL divergence and mutual information that are more accurate than previous techniques that are specifically designed to approximate those statistical distances.

ICLR Conference 2022 Conference Paper

Learning Object-Oriented Dynamics for Planning from Text

  • Guiliang Liu
  • Ashutosh Adhikari
  • Amir Massoud Farahmand
  • Pascal Poupart

The advancement of dynamics models enables model-based planning in complex environments. Existing dynamics models commonly study image-based games with fully observable states. Generalizing these models to Text-Based Games (TBGs), which commonly describe the partially observable states with noisy text observations, is challenging. In this work, we propose an Object-Oriented Text Dynamics (OOTD) model that enables planning algorithms to solve decision-making problems in text domains. OOTD predicts a memory graph that dynamically remembers the history of object observations and filters object-irrelevant information. To facilitate the robustness of dynamics, our OOTD model identifies the objects influenced by input actions and predicts the belief of object states with independently parameterized transition layers. We develop variational objectives under the object-supervised and self-supervised settings to model the stochasticity of predicted dynamics. Empirical results show OOTD-based planner significantly outperforms model-free baselines in terms of sample efficiency and running scores.

UAI Conference 2022 Conference Paper

Linearizing contextual bandits with latent state dynamics

  • Elliot Nelson
  • Debarun Bhattacharjya
  • Tian Gao 0007
  • Miao Liu 0001
  • Djallel Bouneffouf 0001
  • Pascal Poupart

In many real-world applications of multi-armed bandit problems, both rewards and contexts are often influenced by confounding latent variables which evolve stochastically over time. While the observed contexts and rewards are nonlinearly related, we show that prior knowledge of latent causal structure can be used to reduce the problem to the linear bandit setting. We develop two algorithms, Latent Linear Thompson Sampling (L2TS) and Latent Linear UCB (L2UCB), which use online EM algorithms for hidden Markov models to learn the latent transition model and maintain a posterior belief over the latent state, and then use the resulting posteriors as context features in a linear bandit problem. We upper bound the error in reward estimation in the presence of a dynamical latent state, and derive a novel problem-dependent regret bound for linear Thompson sampling with non-stationarity and unconstrained reward distributions, which we apply to L2TS under certain conditions. Finally, we demonstrate the superiority of our algorithms over related bandit algorithms through experiments.

JMLR Journal 2022 Journal Article

Optimality and Stability in Non-Convex Smooth Games

  • Guojun Zhang
  • Pascal Poupart
  • Yaoliang Yu

Convergence to a saddle point for convex-concave functions has been studied for decades, while recent years has seen a surge of interest in non-convex (zero-sum) smooth games, motivated by their recent wide applications. It remains an intriguing research challenge how local optimal points are defined and which algorithm can converge to such points. An interesting concept is known as the local minimax point, which strongly correlates with the widely-known gradient descent ascent algorithm. This paper aims to provide a comprehensive analysis of local minimax points, such as their relation with other solution concepts and their optimality conditions. We find that local saddle points can be regarded as a special type of local minimax points, called uniformly local minimax points, under mild continuity assumptions. In (non-convex) quadratic games, we show that local minimax points are (in some sense) equivalent to global minimax points. Finally, we study the stability of gradient algorithms near local minimax points. Although gradient algorithms can converge to local/global minimax points in the non-degenerate case, they would often fail in general cases. This implies the necessity of either novel algorithms or concepts beyond saddle points and minimax points in non-convex smooth games. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Uncertainty-Aware Reinforcement Learning for Risk-Sensitive Player Evaluation in Sports Game

  • Guiliang Liu
  • Yudong Luo
  • Oliver Schulte
  • Pascal Poupart

A major task of sports analytics is player evaluation. Previous methods commonly measured the impact of players' actions on desirable outcomes (e. g. , goals or winning) without considering the risk induced by stochastic game dynamics. In this paper, we design an uncertainty-aware Reinforcement Learning (RL) framework to learn a risk-sensitive player evaluation metric from stochastic game dynamics. To embed the risk of a player’s movements into the distribution of action-values, we model their 1) aleatoric uncertainty, which represents the intrinsic stochasticity in a sports game, and 2) epistemic uncertainty, which is due to a model's insufficient knowledge regarding Out-of-Distribution (OoD) samples. We demonstrate how a distributional Bellman operator and a feature-space density model can capture these uncertainties. Based on such uncertainty estimation, we propose a Risk-sensitive Game Impact Metric (RiGIM) that measures players' performance over a season by conditioning on a specific confidence level. Empirical evaluation, based on over 9M play-by-play ice hockey and soccer events, shows that RiGIM correlates highly with standard success measures and has a consistent risk sensitivity.

NeurIPS Conference 2021 Conference Paper

Learning Tree Interpretation from Object Representation for Deep Reinforcement Learning

  • Guiliang Liu
  • Xiangyu Sun
  • Oliver Schulte
  • Pascal Poupart

Interpreting Deep Reinforcement Learning (DRL) models is important to enhance trust and comply with transparency regulations. Existing methods typically explain a DRL model by visualizing the importance of low-level input features with super-pixels, attentions, or saliency maps. Our approach provides an interpretation based on high-level latent object features derived from a disentangled representation. We propose a Represent And Mimic (RAMi) framework for training 1) an identifiable latent representation to capture the independent factors of variation for the objects and 2) a mimic tree that extracts the causal impact of the latent features on DRL action values. To jointly optimize both the fidelity and the simplicity of a mimic tree, we derive a novel Minimum Description Length (MDL) objective based on the Information Bottleneck (IB) principle. Based on this objective, we describe a Monte Carlo Regression Tree Search (MCRTS) algorithm that explores different splits to find the IB-optimal mimic tree. Experiments show that our mimic tree achieves strong approximation performance with significantly fewer nodes than baseline models. We demonstrate the interpretability of our mimic tree by showing latent traversals, decision rules, causal impacts, and human evaluation results.

AAMAS Conference 2021 Conference Paper

Partially Observable Mean Field Reinforcement Learning

  • Sriram Ganapathi Subramanian
  • Matthew E. Taylor
  • Mark Crowley
  • Pascal Poupart

Traditional multi-agent reinforcement learning algorithms are not scalable to environments with more than a few agents, since these algorithms are exponential in the number of agents. Recent research has introduced successful methods to scale multi-agent reinforcement learning algorithms to many agent scenarios using mean field theory. Previous work in this field assumes that an agent has access to exact cumulative metrics regarding the mean field behaviour of the system, which it can then use to take its actions. In this paper, we relax this assumption and maintain a distribution to model the uncertainty regarding the mean field of the system. We consider two different settings for this problem. In the first setting, only agents in a fixed neighbourhood are visible, while in the second setting, the visibility of agents is determined at random based on distances. For each of these settings, we introduce a 𝑄-learning based algorithm that can learn effectively. We prove that this 𝑄learning estimate stays very close to the Nash 𝑄-value (under a common set of assumptions) for the first setting. We also empirically show our algorithms outperform multiple baselines in three different games in the MAgents framework, which supports large environments with many agents learning simultaneously to achieve possibly distinct goals.

NeurIPS Conference 2021 Conference Paper

Quantifying and Improving Transferability in Domain Generalization

  • Guojun Zhang
  • Han Zhao
  • Yaoliang Yu
  • Pascal Poupart

Out-of-distribution generalization is one of the key challenges when transferring a model from the lab to the real world. Existing efforts mostly focus on building invariant features among source and target domains. Based on invariant features, a high-performing classifier on source domains could hopefully behave equally well on a target domain. In other words, we hope the invariant features to be \emph{transferable}. However, in practice, there are no perfectly transferable features, and some algorithms seem to learn ``more transferable'' features than others. How can we understand and quantify such \emph{transferability}? In this paper, we formally define transferability that one can quantify and compute in domain generalization. We point out the difference and connection with common discrepancy measures between domains, such as total variation and Wasserstein distance. We then prove that our transferability can be estimated with enough samples and give a new upper bound for the target error based on our transferability. Empirically, we evaluate the transferability of the feature embeddings learned by existing algorithms for domain generalization. Surprisingly, we find that many algorithms are not quite learning transferable features, although few could still survive. In light of this, we propose a new algorithm for learning transferable features and test it over various benchmark datasets, including RotatedMNIST, PACS, Office-Home and WILDS-FMoW. Experimental results show that the proposed algorithm achieves consistent improvement over many state-of-the-art algorithms, corroborating our theoretical findings.

UAI Conference 2020 Conference Paper

Batch norm with entropic regularization turns deterministic autoencoders into generative models

  • Amur Ghose
  • Abdullah Rashwan
  • Pascal Poupart

The variational autoencoder is a well defined deep generative model that utilizes an encoder-decoder framework where an encoding neural network outputs a non-deterministic code for reconstructing an input. The encoder achieves this by sampling from a distribution for every input, instead of outputting a deterministic code per input. The great advantage of this process is that it allows the use of the network as a generative model for sampling from the data distribution beyond provided samples for training. We show in this work that utilizing batch normalization as a source for non-determinism suffices to turn deterministic autoencoders into generative models on par with variational ones, so long as we add a suitable entropic regularization to the training objective.

AAAI Conference 2020 Conference Paper

Diachronic Embedding for Temporal Knowledge Graph Completion

  • Rishab Goel
  • Seyed Mehran Kazemi
  • Marcus Brubaker
  • Pascal Poupart

Knowledge graphs (KGs) typically contain temporal facts indicating relationships among entities at different times. Due to their incompleteness, several approaches have been proposed to infer new facts for a KG based on the existing ones– a problem known as KG completion. KG embedding approaches have proved effective for KG completion, however, they have been developed mostly for static KGs. Developing temporal KG embedding models is an increasingly important problem. In this paper, we build novel models for temporal KG completion through equipping static models with a diachronic entity embedding function which provides the characteristics of entities at any point in time. This is in contrast to the existing temporal KG embedding approaches where only static entity features are provided. The proposed embedding function is model-agnostic and can be potentially combined with any static model. We prove that combining it with SimplE, a recent model for static KG embedding, results in a fully expressive model for temporal KG completion. Our experiments indicate the superiority of our proposal compared to existing baselines.

IJCAI Conference 2020 Conference Paper

Inverse Reinforcement Learning for Team Sports: Valuing Actions and Players

  • Yudong Luo
  • Oliver Schulte
  • Pascal Poupart

A major task of sports analytics is to rank players based on the impact of their actions. Recent methods have applied reinforcement learning (RL) to assess the value of actions from a learned action value or Q-function. A fundamental challenge for estimating action values is that explicit reward signals (goals) are very sparse in many team sports, such as ice hockey and soccer. This paper combines Q-function learning with inverse reinforcement learning (IRL) to provide a novel player ranking method. We treat professional play as expert demonstrations for learning an implicit reward function. Our method alternates single-agent IRL to learn a reward function for multiple agents; we provide a theoretical justification for this procedure. Knowledge transfer is used to combine learned rewards and observed rewards from goals. Empirical evaluation, based on 4. 5M play-by-play events in the National Hockey League (NHL), indicates that player ranking using the learned rewards achieves high correlations with standard success measures and temporal consistency throughout a season.

NeurIPS Conference 2020 Conference Paper

Learning Agent Representations for Ice Hockey

  • Guiliang Liu
  • Oliver Schulte
  • Pascal Poupart
  • Mike Rudd
  • Mehrsan Javan

Team sports is a new application domain for agent modeling with high real-world impact. A fundamental challenge for modeling professional players is their large number (over 1K), which includes many bench players with sparse participation in a game season. The diversity and sparsity of player observations make it difficult to extend previous agent representation models to the sports domain. This paper develops a new approach for agent representations, based on a Markov game model, that is tailored towards applications in professional ice hockey. We introduce a novel player representation via player generation framework where a variational encoder embeds player information with latent variables. The encoder learns a context-specific shared prior to induce a shrinkage effect for the posterior player representations, allowing it to share statistical information across players with different participations. To model the play dynamics in sequential sports data, we design a Variational Recurrent Ladder Agent Encoder (VaRLAE). It learns a contextualized player representation with a hierarchy of latent variables that effectively prevents latent posterior collapse. We validate our player representations in major sports analytics tasks. Our experimental results, based on a large dataset that contains over 4. 5M events, show state-of-the-art performance for our VarLAE on facilitating 1) identifying the acting player, 2) estimating expected goals, and 3) predicting the final score difference.

NeurIPS Conference 2020 Conference Paper

Learning Dynamic Belief Graphs to Generalize on Text-Based Games

  • Ashutosh Adhikari
  • Xingdi Yuan
  • Marc-Alexandre Côté
  • Mikuláš Zelinka
  • Marc-Antoine Rondeau
  • Romain Laroche
  • Pascal Poupart
  • Jian Tang

Playing text-based games requires skills in processing natural language and sequential decision making. Achieving human-level performance on text-based games remains an open challenge, and prior research has largely relied on hand-crafted structured representations and heuristics. In this work, we investigate how an agent can plan and generalize in text-based games using graph-structured representations learned end-to-end from raw text. We propose a novel graph-aided transformer agent (GATA) that infers and updates latent belief graphs during planning to enable effective action selection by capturing the underlying game dynamics. GATA is trained using a combination of reinforcement and self-supervised learning. Our work demonstrates that the learned graph-based representations help agents converge to better policies than their text-only counterparts and facilitate effective generalization across game configurations. Experiments on 500+ unique games from the TextWorld suite show that our best agent outperforms text-based baselines by an average of 24. 2%.

ICML Conference 2020 Conference Paper

Online Bayesian Moment Matching based SAT Solver Heuristics

  • Haonan Duan 0002
  • Saeed Nejati
  • George Trimponias
  • Pascal Poupart
  • Vijay Ganesh 0001

In this paper, we present a Bayesian Moment Matching (BMM) based method aimed at solving the initialization problem in Boolean SAT solvers. The initialization problem can be stated as follows: given a SAT formula $\phi$, compute an initial order over the variables of $\phi$ and values/polarity for these variables such that the runtime of SAT solvers on input $\phi$ is minimized. At the start of a solver run, our BMM-based methods compute a posterior probability distribution for an assignment to the variables of the input formula after analyzing its clauses, which will then be used by the solver to initialize its search. We perform extensive experiments to evaluate the efficacy of our BMM-based heuristic against 4 other initialization methods (random, survey propagation, Jeroslow-Wang, and default) in state-of-the-art solvers, MapleCOMSPS and MapleLCMDistChronotBT over the SAT competition 2018 application benchmark, as well as the best-known solvers in the cryptographic category, namely, CryptoMiniSAT, Glucose, and MapleSAT. On the cryptographic benchmark, BMM-based solvers out-perform all other initialization methods. Further, the BMM-based MapleCOMSPS significantly out-perform the same solver using all other initialization methods by 12 additional instances solved and better average runtime, over the SAT 2018 competition benchmark.

ICLR Conference 2020 Conference Paper

Progressive Memory Banks for Incremental Domain Adaptation

  • Nabiha Asghar
  • Lili Mou
  • Kira A. Selby
  • Kevin D. Pantasdo
  • Pascal Poupart
  • Xin Jiang 0002

This paper addresses the problem of incremental domain adaptation (IDA) in natural language processing (NLP). We assume each domain comes one after another, and that we could only access data in the current domain. The goal of IDA is to build a unified model performing well on all the domains that we have encountered. We adopt the recurrent neural network (RNN) widely used in NLP, but augment it with a directly parameterized memory bank, which is retrieved by an attention mechanism at each step of RNN transition. The memory bank provides a natural way of IDA: when adapting our model to a new domain, we progressively add new slots to the memory bank, which increases the number of parameters, and thus the model capacity. We learn the new memory slots and fine-tune existing parameters by back-propagation. Experimental results show that our approach achieves significantly better performance than fine-tuning alone. Compared with expanding hidden states, our approach is more robust for old domains, shown by both empirical and theoretical results. Our model also outperforms previous work of IDA including elastic weight consolidation and progressive neural networks in the experiments.

JMLR Journal 2020 Journal Article

Representation Learning for Dynamic Graphs: A Survey

  • Seyed Mehran Kazemi
  • Rishab Goel
  • Kshitij Jain
  • Ivan Kobyzev
  • Akshay Sethi
  • Peter Forsyth
  • Pascal Poupart

Graphs arise naturally in many real-world applications including social networks, recommender systems, ontologies, biology, and computational finance. Traditionally, machine learning models for graphs have been mostly designed for static graphs. However, many applications involve evolving graphs. This introduces important challenges for learning and inference since nodes, attributes, and edges change over time. In this survey, we review the recent advances in representation learning for dynamic graphs, including dynamic knowledge graphs. We describe existing models from an encoder-decoder perspective, categorize these encoders and decoders based on the techniques they employ, and analyze the approaches in each category. We also review several prominent applications and widely used datasets and highlight directions for future research. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

IJCAI Conference 2020 Conference Paper

Unsupervised Multilingual Alignment using Wasserstein Barycenter

  • Xin Lian
  • Kshitij Jain
  • Jakub Truszkowski
  • Pascal Poupart
  • Yaoliang Yu

We study unsupervised multilingual alignment, the problem of finding word-to-word translations between multiple languages without using any parallel data. One popular strategy is to reduce multilingual alignment to the much simplified bilingual setting, by picking one of the input languages as the pivot language that we transit through. However, it is well-known that transiting through a poorly chosen pivot language (such as English) may severely degrade the translation quality, since the assumed transitive relations among all pairs of languages may not be enforced in the training process. Instead of going through a rather arbitrarily chosen pivot language, we propose to use the Wasserstein barycenter as a more informative ``mean'' language: it encapsulates information from all languages and minimizes all pairwise transportation costs. We evaluate our method on standard benchmarks and demonstrate state-of-the-art performances.

UAI Conference 2019 Conference Paper

Comparing EM with GD in Mixture Models of Two Components

  • Guojun Zhang
  • Pascal Poupart
  • George Trimponias

The expectation-maximization (EM) algorithm has been widely used in minimizing the negative log likelihood (also known as cross entropy) of mixture models. However, little is understood about the goodness of the fixed points it converges to. In this paper, we study the regions where one component is missing in two-component mixture models, which we call one-cluster regions. We analyze the propensity of such regions to trap EM and gradient descent (GD) for mixtures of two Gaussians and mixtures of two Bernoullis. In the case of Gaussian mixtures, EM escapes one-cluster regions exponentially fast, while GD escapes them linearly fast. In the case of mixtures of Bernoullis, we find that there exist one-cluster regions that are stable for GD and therefore trap GD, but those regions are unstable for EM, allowing {EM} to escape. Those regions are local minima that appear universally in experiments and can be arbitrarily bad. This work implies that EM is less likely than GD to converge to certain bad local optima in mixture models.

RLDM Conference 2019 Conference Abstract

Multi Type Mean Field Reinforcement Learning

  • Sriram Ganapathi Subramanian
  • Pascal Poupart
  • Matthew Taylor
  • Nidhi Hegde

Mean field theory has been integrated with the field of multiagent reinforcement learning to enable multiagent algorithms to scale to a large number of interacting agents in the environment. In this paper, we extend mean field multiagent algorithms to multiple types. The types enable the relaxation of a core assumption in mean field games, which is the assumption that all agents in the environment are playing almost similar strategies and have the same goal. We consider two new testbeds for the field of many agent reinforcement learning, based on the standard MAgents testbed for many agent environments. Here we consider two different kinds of mean field games. In the first kind of games, agents belong to predefined types that are known a priori. In the second kind of games, the type of each agent is unknown and therefore must be learned based on observations. We introduce new algorithms for each of the scenarios and demonstrate superior performance to state of the art algorithms that assume that all agents belong to the same type and other baseline algorithms in the MAgent framework.

UAI Conference 2019 Conference Paper

On the Relationship Between Satisfiability and Markov Decision Processes

  • Ricardo Salmon
  • Pascal Poupart

Stochastic satisfiability (SSAT) and decision-theoretic planning in finite horizon partially observable Markov decision processes (POMDPs) are both PSPACE-Complete. We describe constructive reductions between SSAT and flat POMDPs that open the door to comparisons and future cross-fertilization between the solution techniques of those problems. We also propose a new SSAT solver called Prime that incorporates recent advances from the SAT and #SAT literature. Using our reduction from POMDP to SSAT, we demonstrate the competitiveness of Prime on finite horizon POMDP problems.

IJCAI Conference 2018 Conference Paper

An Empirical Study of Branching Heuristics through the Lens of Global Learning Rate

  • Jia Liang
  • Hari Govind
  • Pascal Poupart
  • Krzysztof Czarnecki
  • Vijay Ganesh

In this paper, we analyze a suite of 7 well-known branching heuristics proposed by the SAT community and show that the better heuristics tend to generate more learnt clauses per decision, a metric we define as the global learning rate (GLR). We propose GLR as a metric for the branching heuristic to optimize. We test our hypothesis by developing a new branching heuristic that maximizes GLR greedily. We show empirically that this heuristic achieves very high GLR and interestingly very low literal block distance (LBD) over the learnt clauses. In our experiments this greedy branching heuristic enables the solver to solve instances faster than VSIDS, when the branching time is taken out of the equation. This experiment is a good proof of concept that a branching heuristic maximizing GLR will lead to good solver performance modulo the computational overhead. Finally, we propose a new branching heuristic, called SGDB, that uses machine learning to cheapily approximate greedy maximization of GLR. We show experimentally that SGDB performs on par with the VSIDS branching heuristic.

NeurIPS Conference 2018 Conference Paper

Deep Homogeneous Mixture Models: Representation, Separation, and Approximation

  • Priyank Jaini
  • Pascal Poupart
  • Yaoliang Yu

At their core, many unsupervised learning models provide a compact representation of homogeneous density mixtures, but their similarities and differences are not always clearly understood. In this work, we formally establish the relationships among latent tree graphical models (including special cases such as hidden Markov models and tensorial mixture models), hierarchical tensor formats and sum-product networks. Based on this connection, we then give a unified treatment of exponential separation in \emph{exact} representation size between deep mixture architectures and shallow ones. In contrast, for \emph{approximate} representation, we show that the conditional gradient algorithm can approximate any homogeneous mixture within $\epsilon$ accuracy by combining $O(1/\epsilon^2)$ ``shallow'' architectures, where the hidden constant may decrease (exponentially) with respect to the depth. Our experiments on both synthetic and real datasets confirm the benefits of depth in density estimation.

AAMAS Conference 2018 Conference Paper

Faster Policy Adaptation in Environments with Exogeneity: A State Augmentation Approach

  • Zhuoshu Li
  • Zhitang Chen
  • Pascal Poupart
  • Sanmay Das
  • Yanhui Geng

The reinforcement learning literature typically assumes fixed state transition functions for the sake of tractability. However, in many real-world tasks, the state transition function changes over time, and this change may be governed by exogenous variables outside of the control loop. This can make policy learning difficult. In this paper, we propose a new algorithm to address the aforementioned challenge by embedding the state transition functions at different timestamps into a Reproducing Kernel Hilbert Space; the exogenous variable, as the cause of the state transition evolution, is estimated by projecting the embeddings into the subspace that preserves maximum variance. By augmenting the observable state vector with the estimated exogenous variable, standard RL algorithms such as Q-learning are able to learn faster and better. Experiments with both synthetic and real data demonstrate the superiority of our proposed algorithm over standard and advanced variants of Q-learning algorithms in dynamic environments.

NeurIPS Conference 2018 Conference Paper

Monte-Carlo Tree Search for Constrained POMDPs

  • JongMin Lee
  • Geon-Hyeong Kim
  • Pascal Poupart
  • Kee-Eung Kim

Monte-Carlo Tree Search (MCTS) has been successfully applied to very large POMDPs, a standard model for stochastic sequential decision-making problems. However, many real-world problems inherently have multiple goals, where multi-objective formulations are more natural. The constrained POMDP (CPOMDP) is such a model that maximizes the reward while constraining the cost, extending the standard POMDP model. To date, solution methods for CPOMDPs assume an explicit model of the environment, and thus are hardly applicable to large-scale real-world problems. In this paper, we present CC-POMCP (Cost-Constrained POMCP), an online MCTS algorithm for large CPOMDPs that leverages the optimization of LP-induced parameters and only requires a black-box simulator of the environment. In the experiments, we demonstrate that CC-POMCP converges to the optimal stochastic action selection in CPOMDP and pushes the state-of-the-art by being able to scale to very large problems.

NeurIPS Conference 2018 Conference Paper

Online Structure Learning for Feed-Forward and Recurrent Sum-Product Networks

  • Agastya Kalra
  • Abdullah Rashwan
  • Wei-Shou Hsu
  • Pascal Poupart
  • Prashant Doshi
  • Georgios Trimponias

Sum-product networks have recently emerged as an attractive representation due to their dual view as a special type of deep neural network with clear semantics and a special type of probabilistic graphical model for which inference is always tractable. Those properties follow from some conditions (i. e. , completeness and decomposability) that must be respected by the structure of the network. As a result, it is not easy to specify a valid sum-product network by hand and therefore structure learning techniques are typically used in practice. This paper describes a new online structure learning technique for feed-forward and recurrent SPNs. The algorithm is demonstrated on real-world datasets with continuous features for which it is not clear what network architecture might be best, including sequence datasets of varying length.

AAAI Conference 2018 Conference Paper

Order-Planning Neural Text Generation From Structured Data

  • Lei Sha
  • Lili Mou
  • Tianyu Liu
  • Pascal Poupart
  • Sujian Li
  • Baobao Chang
  • Zhifang Sui

Generating texts from structured data (e. g. , a table) is important for various natural language processing tasks such as question answering and dialog systems. In recent studies, researchers use neural language models and encoder-decoder frameworks for table-to-text generation. However, these neural network-based approaches typically do not model the order of content during text generation. When a human writes a summary based on a given table, he or she would probably consider the content order before wording. In this paper, we propose an order-planning text generation model, where order information is explicitly captured by link-based attention. Then a self-adaptive gate combines the link-based attention with traditional content-based attention. We conducted experiments on the WIKIBIO dataset and achieve higher performance than previous methods in terms of BLEU, ROUGE, and NIST scores; we also performed ablation tests to analyze each component of our model. 1

NeurIPS Conference 2018 Conference Paper

Unsupervised Video Object Segmentation for Deep Reinforcement Learning

  • Vikash Goel
  • Jameson Weng
  • Pascal Poupart

We present a new technique for deep reinforcement learning that automatically detects moving objects and uses the relevant information for action selection. The detection of moving objects is done in an unsupervised way by exploiting structure from motion. Instead of directly learning a policy from raw images, the agent first learns to detect and segment moving objects by exploiting flow information in video sequences. The learned representation is then used to focus the policy of the agent on the moving objects. Over time, the agent identifies which objects are critical for decision making and gradually builds a policy based on relevant moving objects. This approach, which we call Motion-Oriented REinforcement Learning (MOREL), is demonstrated on a suite of Atari games where the ability to detect moving objects reduces the amount of interaction needed with the environment to obtain a good policy. Furthermore, the resulting policy is more interpretable than policies that directly map images to actions or values with a black box neural network. We can gain insight into the policy by inspecting the segmentation and motion of each object detected by the agent. This allows practitioners to confirm whether a policy is making decisions based on sensible information. Our code is available at https: //github. com/vik-goel/MOREL.

SAT Conference 2017 Conference Paper

A Propagation Rate Based Splitting Heuristic for Divide-and-Conquer Solvers

  • Saeed Nejati
  • Zack Newsham
  • Joseph Scott
  • Jia Hui Liang
  • Catherine H. Gebotys
  • Pascal Poupart
  • Vijay Ganesh 0001

Abstract In this paper, we present a divide-and-conquer SAT solver, M apleAmpharos, that uses a novel propagation-rate (PR) based splitting heuristic. The key idea is that we rank variables based on the ratio of how many propagations they cause during the run of the worker conflict-driven clause-learning solvers to the number of times they are branched on, with the variable that causes the most propagations ranked first. The intuition here is that, in the context of divide-and-conquer solvers, it is most profitable to split on variables that maximize the propagation rate. Our implementation M apleAmpharos uses the AMPHAROS solver as its base. We performed extensive evaluation of M apleAmpharos against other competitive parallel solvers such as Treengeling, Plingeling, Parallel CryptoMiniSat5, and Glucose-Syrup. We show that on the SAT 2016 competition Application benchmark and a set of cryptographic instances, our solver M apleAmpharos is competitive with respect to these top parallel solvers. What is surprising that we obtain this result primarily by modifying the splitting heuristic.

SAT Conference 2017 Conference Paper

An Empirical Study of Branching Heuristics Through the Lens of Global Learning Rate

  • Jia Hui Liang
  • Hari Govind V. K.
  • Pascal Poupart
  • Krzysztof Czarnecki 0001
  • Vijay Ganesh 0001

Abstract In this paper, we analyze a suite of 7 well-known branching heuristics proposed by the SAT community and show that the better heuristics tend to generate more learnt clauses per decision, a metric we define as the global learning rate (GLR). Like our previous work on the LRB branching heuristic, we once again view these heuristics as techniques to solve the learning rate optimization problem. First, we show that there is a strong positive correlation between GLR and solver efficiency for a variety of branching heuristics. Second, we test our hypothesis further by developing a new branching heuristic that maximizes GLR greedily. We show empirically that this heuristic achieves very high GLR and interestingly very low literal block distance (LBD) over the learnt clauses. In our experiments this greedy branching heuristic enables the solver to solve instances faster than VSIDS, when the branching time is taken out of the equation. This experiment is a good proof of concept that a branching heuristic maximizing GLR will lead to good solver performance modulo the computational overhead. Third, we propose that machine learning algorithms are a good way to cheaply approximate the greedy GLR maximization heuristic as already witnessed by LRB. In addition, we design a new branching heuristic, called SGDB, that uses a stochastic gradient descent online learning method to dynamically order branching variables in order to maximize GLR. We show experimentally that SGDB performs on par with the VSIDS branching heuristic.

IJCAI Conference 2017 Conference Paper

Constrained Bayesian Reinforcement Learning via Approximate Linear Programming

  • JongMin Lee
  • Youngsoo Jang
  • Pascal Poupart
  • Kee-Eung Kim

In this paper, we consider the safe learning scenario where we need to restrict the exploratory behavior of a reinforcement learning agent. Specifically, we treat the problem as a form of Bayesian reinforcement learning in an environment that is modeled as a constrained MDP (CMDP) where the cost function penalizes undesirable situations. We propose a model-based Bayesian reinforcement learning (BRL) algorithm for such an environment, eliciting risk-sensitive exploration in a principled way. Our algorithm efficiently solves the constrained BRL problem by approximate linear programming, and generates a finite state controller in an off-line manner. We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art.

AAAI Conference 2017 Short Paper

Discovering Conversational Dependencies between Messages in Dialogs

  • Wenchao Du
  • Pascal Poupart
  • Wei Xu

We investigate the task of inferring conversational dependencies between messages in one-on-one online chat, which has become one of the most popular forms of customer service. We propose a novel probabilistic classifier that leverages conversational, lexical and semantic information. The approach is evaluated empirically on a set of customer service chat logs from a Chinese e-commerce website. It outperforms heuristic baselines.

NeurIPS Conference 2016 Conference Paper

A Unified Approach for Learning the Parameters of Sum-Product Networks

  • Han Zhao
  • Pascal Poupart
  • Geoffrey Gordon

We present a unified approach for learning the parameters of Sum-Product networks (SPNs). We prove that any complete and decomposable SPN is equivalent to a mixture of trees where each tree corresponds to a product of univariate distributions. Based on the mixture model perspective, we characterize the objective function when learning SPNs based on the maximum likelihood estimation (MLE) principle and show that the optimization problem can be formulated as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) and the concave-convex procedure (CCCP), respectively. The two proposed methods naturally admit multiplicative updates, hence effectively avoiding the projection operation. With the help of the unified framework, we also show that, in the case of SPNs, CCCP leads to the same algorithm as Expectation Maximization (EM) despite the fact that they are different in general.

AAAI Conference 2016 Conference Paper

Decision Sum-Product-Max Networks

  • Mazen Melibari
  • Pascal Poupart
  • Prashant Doshi

Sum-Product Networks (SPNs) were recently proposed as a new class of probabilistic graphical models that guarantee tractable inference, even on models with high-treewidth. In this paper, we propose a new extension to SPNs, called Decision Sum-Product-Max Networks (Decision-SPMNs), that makes SPNs suitable for discrete multi-stage decision problems. We present an algorithm that solves Decision-SPMNs in a time that is linear in the size of the network. We also present algorithms to learn the parameters of the network from data.

AAAI Conference 2016 Conference Paper

Exponential Recency Weighted Average Branching Heuristic for SAT Solvers

  • Jia Liang
  • Vijay Ganesh
  • Pascal Poupart
  • Krzysztof Czarnecki

Modern conflict-driven clause-learning SAT solvers routinely solve large real-world instances with millions of clauses and variables in them. Their success crucially depends on effective branching heuristics. In this paper, we propose a new branching heuristic inspired by the exponential recency weighted average algorithm used to solve the bandit problem. The branching heuristic, we call CHB, learns online which variables to branch on by leveraging the feedback received from conflict analysis. We evaluated CHB on 1200 instances from the SAT Competition 2013 and 2014 instances, and showed that CHB solves significantly more instances than VSIDS, currently the most effective branching heuristic in widespread use. More precisely, we implemented CHB as part of the MiniSat and Glucose solvers, and performed an apple-to-apple comparison with their VSIDS-based variants. CHB-based MiniSat (resp. CHB-based Glucose) solved approximately 16. 1% (resp. 5. 6%) more instances than their VSIDS-based variants. Additionally, CHB-based solvers are much more efficient at constructing first preimage attacks on step-reduced SHA-1 and MD5 cryptographic hash functions, than their VSIDS-based counterparts. To the best of our knowledge, CHB is the first branching heuristic to solve significantly more instances than VSIDS on a large, diverse benchmark of real-world instances.

SAT Conference 2016 Conference Paper

Learning Rate Based Branching Heuristic for SAT Solvers

  • Jia Hui Liang
  • Vijay Ganesh 0001
  • Pascal Poupart
  • Krzysztof Czarnecki 0001

Abstract In this paper, we propose a framework for viewing solver branching heuristics as optimization algorithms where the objective is to maximize the learning rate, defined as the propensity for variables to generate learnt clauses. By viewing online variable selection in SAT solvers as an optimization problem, we can leverage a wide variety of optimization algorithms, especially from machine learning, to design effective branching heuristics. In particular, we model the variable selection optimization problem as an online multi-armed bandit, a special-case of reinforcement learning, to learn branching variables such that the learning rate of the solver is maximized. We develop a branching heuristic that we call learning rate branching or LRB, based on a well-known multi-armed bandit algorithm called exponential recency weighted average and implement it as part of MiniSat and CryptoMiniSat. We upgrade the LRB technique with two additional novel ideas to improve the learning rate by accounting for reason side rate and exploiting locality. The resulting LRB branching heuristic is shown to be faster than the VSIDS and conflict history-based (CHB) branching heuristics on 1975 application and hard combinatorial instances from 2009 to 2014 SAT Competitions. We also show that CryptoMiniSat with LRB solves more instances than the one with VSIDS. These experiments show that LRB improves on state-of-the-art.

NeurIPS Conference 2016 Conference Paper

Online Bayesian Moment Matching for Topic Modeling with Unknown Number of Topics

  • Wei-Shou Hsu
  • Pascal Poupart

Latent Dirichlet Allocation (LDA) is a very popular model for topic modeling as well as many other problems with latent groups. It is both simple and effective. When the number of topics (or latent groups) is unknown, the Hierarchical Dirichlet Process (HDP) provides an elegant non-parametric extension; however, it is a complex model and it is difficult to incorporate prior knowledge since the distribution over topics is implicit. We propose two new models that extend LDA in a simple and intuitive fashion by directly expressing a distribution over the number of topics. We also propose a new online Bayesian moment matching technique to learn the parameters and the number of topics of those models based on streaming data. The approach achieves higher log-likelihood than batch and online HDP with fixed hyperparameters on several corpora.

IJCAI Conference 2016 Conference Paper

Sum-Product-Max Networks for Tractable Decision Making

  • Mazen Melibari
  • Pascal Poupart
  • Prashant Doshi

Investigations into probabilistic graphical models for decision making have predominantly centered on influence diagrams (IDs) and decision circuits (DCs) for representation and computation of decision rules that maximize expected utility. Since IDs are typically handcrafted and DCs are compiled from IDs, in this paper we propose an approach to learn the structure and parameters of decision-making problems directly from data. We present a new representation called sum-product-max network (SPMN) that generalizes a sum-product network (SPN) to the class of decision-making problems and whose solution, analogous to DCs, scales linearly in the size of the network. We show that SPMNs may be reduced to DCs linearly and present a first method for learning SPMNs from data. This approach is significant because it facilitates a novel paradigm of tractable decision making driven by data.

AAAI Conference 2015 Conference Paper

Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes

  • Pascal Poupart
  • Aarti Malhotra
  • Pei Pei
  • Kee-Eung Kim
  • Bongseok Goh
  • Michael Bowling

In many situations, it is desirable to optimize a sequence of decisions by maximizing a primary objective while respecting some constraints with respect to secondary objectives. Such problems can be naturally modeled as constrained partially observable Markov decision processes (CPOMDPs) when the environment is partially observable. In this work, we describe a technique based on approximate linear programming to optimize policies in CPOMDPs. The optimization is performed offline and produces a finite state controller with desirable performance guarantees. The approach outperforms a constrained version of point-based value iteration on a suite of benchmark problems.

ICML Conference 2015 Conference Paper

On the Relationship between Sum-Product Networks and Bayesian Networks

  • Han Zhao 0002
  • Mazen Melibari
  • Pascal Poupart

In this paper, we establish some theoretical connections between Sum-Product Networks (SPNs) and Bayesian Networks (BNs). We prove that every SPN can be converted into a BN in linear time and space in terms of the network size. The key insight is to use Algebraic Decision Diagrams (ADDs) to compactly represent the local conditional probability distributions at each node in the resulting BN by exploiting context-specific independence (CSI). The generated BN has a simple directed bipartite graphical structure. We show that by applying the Variable Elimination algorithm (VE) to the generated BN with ADD representations, we can recover the original SPN where the SPN can be viewed as a history record or caching of the VE inference process. To help state the proof clearly, we introduce the notion of \em normal SPN and present a theoretical analysis of the consistency and decomposability properties. We conclude the paper with some discussion of the implications of the proof and establish a connection between the depth of an SPN and a lower bound of the tree-width of its corresponding BN.

IJCAI Conference 2015 Conference Paper

Self-Adaptive Hierarchical Sentence Model

  • Han Zhao
  • Zhengdong Lu
  • Pascal Poupart

The ability to accurately model a sentence at varying stages (e. g. , word-phrase-sentence) plays a central role in natural language processing. As an effort towards this goal we propose a selfadaptive hierarchical sentence model (AdaSent). AdaSent effectively forms a hierarchy of representations from words to phrases and then to sentences through recursive gated local composition of adjacent segments. We design a competitive mechanism (through gating networks) to allow the representations of the same sentence to be engaged in a particular learning task (e. g. , classification), therefore effectively mitigating the gradient vanishing problem persistent in other recursive models. Both qualitative and quantitative analysis shows that AdaSent can automatically form and select the representations suitable for the task at hand during training, yielding superior classification performance over competitor models on 5 benchmark data sets.

AAAI Conference 2015 Conference Paper

SoF: Soft-Cluster Matrix Factorization for Probabilistic Clustering

  • Han Zhao
  • Pascal Poupart
  • Yongfeng Zhang
  • Martin Lysy

We propose SoF (Soft-cluster matrix Factorization), a probabilistic clustering algorithm which softly assigns each data point into clusters. Unlike model-based clustering algorithms, SoF does not make assumptions about the data density distribution. Instead, we take an axiomatic approach to define 4 properties that the probability of co-clustered pairs of points should satisfy. Based on the properties, SoF utilizes a distance measure between pairs of points to induce the conditional co-cluster probabilities. The objective function in our framework establishes an important connection between probabilistic clustering and constrained symmetric Nonnegative Matrix Factorization (NMF), hence providing a theoretical interpretation for NMF-based clustering algorithms. To optimize the objective, we derive a sequential minimization algorithm using a penalty method. Experimental results on both synthetic and real-world datasets show that SoF significantly outperforms previous NMF-based algorithms and that it is able to detect non-convex patterns as well as cluster boundaries.

IJCAI Conference 2013 Conference Paper

Isomorph-Free Branch and Bound Search for Finite State Controllers

  • Marek Grześ
  • Pascal Poupart
  • Jesse Hoey

The recent proliferation of smart-phones and other wearable devices has lead to a surge of new mobile applications. Partially observable Markov decision processes provide a natural framework to design applications that continuously make decisions based on noisy sensor measurements. However, given the limited battery life, there is a need to minimize the amount of online computation. This can be achieved by compiling a policy into a finite state controller since there is no need for belief monitoring or online search. In this paper, we propose a new branch and bound technique to search for a good controller. In contrast to many existing algorithms for controllers, our search technique is not subject to local optima. We also show how to reduce the amount of search by avoiding the enumeration of isomorphic controllers and by taking advantage of suitable upper and lower bounds. The approach is demonstrated on several benchmark problems as well as a smart-phone application to assist persons with Alzheimer’s to wayfind.

IJCAI Conference 2013 Conference Paper

Learning Community-Based Preferences via Dirichlet Process Mixtures of Gaussian Processes

  • Ehsan Abbasnejad
  • Scott Sanner
  • Edwin V. Bonilla
  • Pascal Poupart

Bayesian approaches to preference learning using Gaussian Processes (GPs) are attractive due to their ability to explicitly model uncertainty in users’ latent utility functions; unfortunately existing techniques have cubic time complexity in the number of users, which renders this approach intractable for collaborative preference learning over a large user base. Exploiting the observation that user populations often decompose into communities of shared preferences, we model user preferences as an infinite Dirichlet Process (DP) mixture of communities and learn (a) the expected number of preference communities represented in the data, (b) a GPbased preference model over items tailored to each community, and (c) the mixture weights representing each user’s fraction of community membership. This results in a learning and inference process that scales linearly in the number of users rather than cubicly and additionally provides the ability to analyze individual community preferences and their associated members. We evaluate our approach on a variety of preference data sources including Amazon Mechanical Turk showing that our method is more scalable and as accurate as previous GP-based preference learning work.

NeurIPS Conference 2012 Conference Paper

Cost-Sensitive Exploration in Bayesian Reinforcement Learning

  • Dongho Kim
  • Kee-Eung Kim
  • Pascal Poupart

In this paper, we consider Bayesian reinforcement learning (BRL) where actions incur costs in addition to rewards, and thus exploration has to be constrained in terms of the expected total cost while learning to maximize the expected long-term total reward. In order to formalize cost-sensitive exploration, we use the constrained Markov decision process (CMDP) as the model of the environment, in which we can naturally encode exploration requirements using the cost function. We extend BEETLE, a model-based BRL method, for learning in the environment with cost constraints. We demonstrate the cost-sensitive exploration behaviour in a number of simulated problems.

AAAI Conference 2012 Conference Paper

Hierarchical Double Dirichlet Process Mixture of Gaussian Processes

  • Aditya Tayal
  • Pascal Poupart
  • Yuying Li

We consider an infinite mixture model of Gaussian processes that share mixture components between nonlocal clusters in data. Meeds and Osindero (2006) use a single Dirichlet process prior to specify a mixture of Gaussian processes using an infinite number of experts. In this paper, we extend this approach to allow for experts to be shared non-locally across the input domain. This is accomplished with a hierarchical double Dirichlet process prior, which builds upon a standard hierarchical Dirichlet process by incorporating local parameters that are unique to each cluster while sharing mixture components between them. We evaluate the model on simulated and real data, showing that sharing Gaussian process components non-locally can yield effective and useful models for richly clustered non-stationary, non-linear data.

NeurIPS Conference 2012 Conference Paper

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

  • Zahra Zamani
  • Scott Sanner
  • Pascal Poupart
  • Kristian Kersting

Partially-observable Markov decision processes (POMDPs) provide a powerful model for real-world sequential decision-making problems. In recent years, point- based value iteration methods have proven to be extremely effective techniques for finding (approximately) optimal dynamic programming solutions to POMDPs when an initial set of belief states is known. However, no point-based work has provided exact point-based backups for both continuous state and observation spaces, which we tackle in this paper. Our key insight is that while there may be an infinite number of possible observations, there are only a finite number of observation partitionings that are relevant for optimal decision-making when a finite, fixed set of reachable belief states is known. To this end, we make two important contributions: (1) we show how previous exact symbolic dynamic pro- gramming solutions for continuous state MDPs can be generalized to continu- ous state POMDPs with discrete observations, and (2) we show how this solution can be further extended via recently developed symbolic methods to continuous state and observations to derive the minimal relevant observation partitioning for potentially correlated, multivariate observation spaces. We demonstrate proof-of- concept results on uni- and multi-variate state and observation steam plant control.

NeurIPS Conference 2011 Conference Paper

Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints

  • Omar Khan
  • Pascal Poupart
  • John-mark Agosta

In this paper, we derive a method to refine a Bayes network diagnostic model by exploiting constraints implied by expert decisions on test ordering. At each step, the expert executes an evidence gathering test, which suggests the test's relative diagnostic value. We demonstrate that consistency with an expert's test selection leads to non-convex constraints on the model parameters. We incorporate these constraints by augmenting the network with nodes that represent the constraint likelihoods. Gibbs sampling, stochastic hill climbing and greedy search algorithms are proposed to find a MAP estimate that takes into account test ordering constraints and any data available. We demonstrate our approach on diagnostic sessions from a manufacturing scenario.

ICAPS Conference 2011 Conference Paper

Closing the Gap: Improved Bounds on Optimal POMDP Solutions

  • Pascal Poupart
  • Kee-Eung Kim
  • Dongho Kim

POMDP algorithms have made significant progress in recent years by allowing practitioners to find good solutions to increasingly large problems. Most approaches (including point-based and policy iteration techniques) operate by refining a lower bound of the optimal value function. Several approaches (e. g. , HSVI2, SARSOP, grid-based approaches and online forward search) also refine an upper bound. However, approximating the optimal value function by an upper bound is computationally expensive and therefore tightness is often sacrificed to improve efficiency (e. g. , sawtooth approximation). In this paper, we describe a new approach to efficiently compute tighter bounds by i) conducting a prioritized breadth first search over the reachable beliefs, ii) propagating upper bound improvements with an augmented POMDP and iii) using exact linear programming (instead of the sawtooth approximation) for upper bound interpolation. As a result, we can represent the bounds more compactly and significantly reduce the gap between upper and lower bounds on several benchmark problems.

IJCAI Conference 2011 Conference Paper

Continuous Correlated Beta Processes

  • Robby Goetschalckx
  • Pascal Poupart
  • Jesse Hoey

In this paper we consider a (possibly continuous) space of Bernoulli experiments. We assume that the Bernoulli distributions of the points are correlated. All evidence data comes in the form of successful or failed experiments at different points. Current state-of-the-art methods for expressing a distribution over a continuum of Bernoulli distributions use logistic Gaussian processes or Gaussian copula processes. However, both of these require computationally expensive matrix operations (cubic in the general case). We introduce a more intuitive approach, directly correlating beta distributions by sharing evidence between them according to a kernel function, an approach which has linear time complexity. The approach can easily be extended to multiple outcomes, giving a continuous correlated Dirichlet process. This approach can be used for classification (both binary and multi-class) and learning the actual probabilities of the Bernoulli distributions. We show results for a number of data sets, as well as a case-study where a mixture of continuous beta processes is used as part of an automated stroke rehabilitation system.

AAMAS Conference 2011 Conference Paper

Escaping Local Optima in POMDP Planning as Inference

  • Pascal Poupart
  • Tobias Lang
  • Marc Toussaint

Planning as inference recently emerged as a versatile approach to decision-theoretic planning and reinforcement learning for single and multi-agent systems in fully and partially observable domains with discrete and continuous variables. Since planning as inference essentially tackles a non-convex optimization problem when the states are partially observable, there is a need to develop techniques that can robustly escape local optima. We propose two algorithms: the first one adds nodes to the controller according to an increasingly deep forward search, while the second one splits nodes in a greedy fashion to improve reward likelihood.

IJCAI Conference 2011 Conference Paper

Point-Based Value Iteration for Constrained POMDPs

  • Dongho Kim
  • Jaesong Lee
  • Kee-Eung Kim
  • Pascal Poupart

Constrained partially observable Markov decision processes (CPOMDPs) extend the standard POMDPs by allowing the specification of constraints on some aspects of the policy in addition to the optimality objective for the value function. CPOMDPs have many practical advantages over standard POMDPs since they naturally model problems involving limited resource or multiple objectives. In this paper, we show that the optimal policies in CPOMDPs can be randomized, and present exact and approximate dynamic programming methods for computing randomized optimal policies. While the exact method requires solving a minimax quadratically constrained program (QCP) in each dynamic programming update, the approximate method utilizes the point-based value update with a linear program (LP). We show that the randomized policies are significantly better than the deterministic ones. We also demonstrate that the approximate point-based method is scalable to solve large problems.

AAMAS Conference 2011 Conference Paper

Smart Walkers! Enhancing the Mobility of the Elderly

  • Mathieu Sinn
  • Pascal Poupart

The idea of Smart Walkers is to equip customary rolling walkers with sensors in order to assist users, caregivers and clinicians. The integral part of the Smart Walkers is an autonomous agent which monitors the activity of the user, assesses his physical conditions, and detects potential risks of falls. In this paper, we study methods which enable the agent to recognize the user activity from the sensor measurements. The proposed methods use Conditional Random Fields with features based on discriminant rules. A special case are features which, in order to distinguish between two activities, compare the sensor measurements to thresholds learned by a linear classifier. Experiments with real user data show that the methods achieve a good accuracy; the best results are obtained using "smooth" thresholds based on sigmoid functions.

UAI Conference 2010 Conference Paper

Comparative Analysis of Probabilistic Models for Activity Recognition with an Instrumented Walker

  • Farheen Omar
  • Mathieu Sinn
  • Jakub Truszkowski
  • Pascal Poupart
  • James Yungjen Tung
  • Allen Caine

Rollating walkers are popular mobility aids used by older adults to improve balance control. There is a need to automatically recognize the activities performed by walker users to better understand activity patterns, mobility issues and the context in which falls are more likely to happen. We design and compare several techniques to recognize walker related activities. A comprehensive evaluation with control subjects and walker users from a retirement community is presented.

ICAPS Conference 2009 Conference Paper

Minimal Sufficient Explanations for Factored Markov Decision Processes

  • Omar Zia Khan
  • Pascal Poupart
  • James P. Black

Explaining policies of Markov Decision Processes (MDPs) is complicated due to their probabilistic and sequential nature. We present a technique to explain policies for factored MDP by populating a set of domain-independent templates. We also present a mechanism to determine a minimal set of templates that, viewed together, completely justify the policy. Our explanations can be generated automatically at run-time with no additional effort required from the MDP designer. We demonstrate our technique using the problems of advising undergraduate students in their course selection and assisting people with dementia in completing the task of handwashing. We also evaluate our explanations for course-advising through a user study involving students.

ICAPS Conference 2008 Conference Paper

Efficient ADD Operations for Point-Based Algorithms

  • Guy Shani
  • Pascal Poupart
  • Ronen I. Brafman
  • Solomon Eyal Shimony

During the past few years, point-based POMDP solvers have gradually scaled up to handle medium sized domains through better selection of the set of points and efficient backup methods. Point-based research has focused on flat, explicit representation of the state space, yet in many realistic domains a factored representation is more appropriate. The latter have exponentially large state-spaces, and current methods are unlikely to handle models of reasonable size. Thus, adapting point-based methods to factored representations by modeling propositional state spaces better, e. g. by using Algebraic Decision Diagrams (ADDs) is needed. While a straightforward ADD-based implementation can effectively tackle large factored POMDPs, we propose several techniques to further improve scalability. In particular, we show how ADDs can be used successfully in factored domains that exhibit reasonable locality. Our algorithms are several orders of magnitude faster than current point-based algorithms used with flat representations.

UAI Conference 2008 Conference Paper

Hierarchical POMDP Controller Optimization by Likelihood Maximization

  • Marc Toussaint
  • Laurent Charlin
  • Pascal Poupart

Planning can often be simplified by decomposing the task into smaller tasks arranged hierarchically. Charlin et al. [4] recently showed that the hierarchy discovery problem can be framed as a non-convex optimization problem. However, the inherent computational difficulty of solving such an optimization problem makes it hard to scale to realworld problems. In another line of research, Toussaint et al. [18] developed a method to solve planning problems by maximumlikelihood estimation. In this paper, we show how the hierarchy discovery problem in partially observable domains can be tackled using a similar maximum likelihood approach. Our technique first transforms the problem into a dynamic Bayesian network through which a hierarchical structure can naturally be discovered while optimizing the policy. Experimental results demonstrate that this approach scales better than previous techniques based on non-convex optimization.

ICML Conference 2006 Conference Paper

An analytic solution to discrete Bayesian reinforcement learning

  • Pascal Poupart
  • Nikos Vlassis
  • Jesse Hoey
  • Kevin Regan 0001

Reinforcement learning (RL) was originally proposed as a framework to allow agents to learn in an online fashion as they interact with their environment. Existing RL algorithms come short of achieving this goal because the amount of exploration required is often too costly and/or too time consuming for online learning. As a result, RL is mostly used for offline learning in simulated environments. We propose a new algorithm, called BEETLE, for effective online learning that is computationally efficient while minimizing the amount of exploration. We take a Bayesian model-based approach, framing RL as a partially observable Markov decision process. Our two main contributions are the analytical derivation that the optimal value function is the upper envelope of a set of multivariate polynomials, and an efficient point-based value iteration algorithm that exploits this simple parameterization.

NeurIPS Conference 2006 Conference Paper

Automated Hierarchy Discovery for Planning in Partially Observable Environments

  • Laurent Charlin
  • Pascal Poupart
  • Romy Shioda

Planning in partially observable domains is a notoriously difficult problem. How- ever, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierar- chical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy.

AIJ Journal 2006 Journal Article

Constraint-based optimization and utility elicitation using the minimax decision criterion

  • Craig Boutilier
  • Relu Patrascu
  • Pascal Poupart
  • Dale Schuurmans

In many situations, a set of hard constraints encodes the feasible configurations of some system or product over which multiple users have distinct preferences. However, making suitable decisions requires that the preferences of a specific user for different configurations be articulated or elicited, something generally acknowledged to be onerous. We address two problems associated with preference elicitation: computing a best feasible solution when the user's utilities are imprecisely specified; and developing useful elicitation procedures that reduce utility uncertainty, with minimal user interaction, to a point where (approximately) optimal decisions can be made. Our main contributions are threefold. First, we propose the use of minimax regret as a suitable decision criterion for decision making in the presence of such utility function uncertainty. Second, we devise several different procedures, all relying on mixed integer linear programs, that can be used to compute minimax regret and regret-optimizing solutions effectively. In particular, our methods exploit generalized additive structure in a user's utility function to ensure tractable computation. Third, we propose various elicitation methods that can be used to refine utility uncertainty in such a way as to quickly (i. e. , with as few questions as possible) reduce minimax regret. Empirical study suggests that several of these methods are quite successful in minimizing the number of user queries, while remaining computationally practical so as to admit real-time user interaction.

JMLR Journal 2006 Journal Article

Point-Based Value Iteration for Continuous POMDPs

  • Josep M. Porta
  • Nikos Vlassis
  • Matthijs T.J. Spaan
  • Pascal Poupart

We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend to deal with continuous action and observation sets by designing effective sampling approaches. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

IJCAI Conference 2005 Conference Paper

A Decision-Theoretic Approach to Task Assistance for Persons with Dementia

  • Jennifer Boger
  • Pascal Poupart
  • Jesse Hoey
  • Craig Boutilier
  • Geoff Fernie
  • Alex

Cognitive assistive technologies that aid people with dementia (such as Alzheimer’s disease) hold the promise to provide such people with an increased level of independence. However, to realize this promise, such systems must account for the specific needs and preferences of individuals. We argue that this form of customization requires a sequential, decision-theoretic model of interaction. We describe both fully and partially observable Markov decision process (POMDP) models of a handwashing task, and show that, despite the potential computational complexity, these can be effectively solved and produce policies that are evaluated as useful by professional caregivers.

IJCAI Conference 2005 Conference Paper

Regret-based Utility Elicitation in Constraint-based Decision Problems

  • Craig Boutilier
  • Relu Patrascu
  • Pascal Poupart
  • Dale

We propose new methods of preference elicitation for constraint-based optimization problems based on the use of minimax regret. Specifically, we assume a constraintbased optimization problem (e. g. , product configuration) in which the objective function (e. g. , consumer preferences) are unknown or imprecisely specified. Assuming a graphical utility model, we describe several elicitation strategies that require the user to answer only binary (bound) queries on the utility model parameters. While a theoretically motivated algorithm can provably reduce regret quickly (in terms of number of queries), we demonstrate that, in practice, heuristic strategies perform much better, and are able to find optimal (or near-optimal) configurations with far fewer queries.

NeurIPS Conference 2004 Conference Paper

VDCBPI: an Approximate Scalable Algorithm for Large POMDPs

  • Pascal Poupart
  • Craig Boutilier

Existing algorithms for discrete partially observable Markov decision processes can at best solve problems of a few thousand states due to two important sources of intractability: the curse of dimensionality and the policy space complexity. This paper describes a new algorithm (VDCBPI) that mitigates both sources of intractability by combining the Value Directed Compression (VDC) technique [13] with Bounded Pol- icy Iteration (BPI) [14]. The scalability of VDCBPI is demonstrated on synthetic network management problems with up to 33 million states. 1 Introduction Partially observable Markov decision processes (POMDPs) provide a natural and expres- sive framework for decision making, but their use in practice has been limited by the lack of scalable solution algorithms. Two important sources of intractability plague discrete model-based POMDPs: high dimensionality of belief space, and the complexity of policy or value function (VF) space. Classic solution algorithms [4, 10, 7], for example, compute value functions represented by exponentially many value vectors, each of exponential size. As a result, they can only solve POMDPs with on the order of 100 states. Consequently, much research has been devoted to mitigating these two sources of intractability. The complexity of policy/VF space has been addressed by observing that there are often very good policies whose value functions are representable by a small number of vectors. Various algorithms such as approximate vector pruning [9], point-based value iteration (PBVI) [12, 16], bounded policy iteration (BPI) [14], gradient ascent (GA) [11, 1] and stochastic local search (SLS) [3] exploit this fact to produce (often near-optimal) policies of low complexity (i. e. , few vectors) allowing larger POMDPs to be solved. Still these scale to problems of only roughly 1000 states, since each value vector may still have ex- ponential dimensionality. Conversely, it has been observed that belief states often carry more information than necessary. Hence, one can often reduce vector dimensionality by using compact representations such as decision trees (DTs) [2], algebraic decision dia- grams (ADDs) [8, 9], or linear combinations of small basis functions (LCBFs) [6], or by indirectly compressing the belief space into a small subspace by a value-directed compres- sion (VDC) [14] or exponential PCA [15]. Once compressed, classic solution methods can be used. However, since none of these approaches address the exponential complexity of policy/VF space, they can only solve slightly larger POMDPs (up to 8250 states [15]). Scalable POMDP algorithms can only be realized when both sources of intractability are tackled simultaneously. While Hansen and Feng [9] implemented such an algorithm by combining approximate state abstraction with approximate vector pruning, they didn't demonstrate the scalability of the approach on large problems. In this paper, we describe how to combine value directed compression (VDC) with bounded policy iteration (BPI) and demonstrate the scalability of the resulting algorithm (VDCBPI) on synthetic network management problems of up to 33 million states. Among the techniques that deal with the curse of dimensionality, VDC offers the advantage that the compressed POMDP can be di- rectly fed into existing POMDP algorithms with no (or only slight) adjustments. This is not the case for exponential-PCA, nor compact representations (DTs, ADDs, LCBFs). Among algorithms that mitigate policy space complexity, BPI distinguishes itself by its ability to avoid local optima (cf. GA), its efficiency (cf. SLS) and the fact that belief state monitoring is not required (cf. PBVI, approximate vector pruning). Beyond the combination of VDC with BPI, we offer two other contributions. We propose a new simple heuristic to compute good lossy value directed compressions. We also augment BPI with the ability to bias its policy search to reachable belief states. As a result, BPI can often find a much smaller policy of similar quality for a given initial belief state. 2 POMDP Background A POMDP is defined by: states S; actions A; observations Z; transition function T, where T (s, a, s ) denotes Pr(s |s, a); observation function Z, where Z(s, z) is the probability Pr(z|s, a) of observation z in state s after executing a; and reward function R, where R(s, a) is the immediate reward associated with s when executing a. We assume discrete state, action and observation sets and focus on discounted, infinite horizon POMDPs with discount factor 0 V (n, s) = Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s, a) Pr(n |n, z)V (n, s ) (1) a z n The value V (n, b) of each node n is thus linear w. r. t the belief state; hence the value function of the controller is piecewise-linear and convex. The optimal value function V often has a large (if not infinite) number of vectors, each corresponding to a different node. The optimal value function V satisfies Bellman's equation: V (b) = max R(b, a) + Pr(z|b, a)V (ba) (2) z a z max s. t. V (n, s) + [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s, a) Pr(a, n |n, z)V (n, s )], s a s, z Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a a n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z Table 1: LP to uniformly improve the value function of a node. max o(s, n) s, n s, n s. t. V (n, s) + s, n [Pr(a|n)R(s, a) + Pr(s |s, a) Pr(z|s, a) Pr(a, n |n, z)V (n, s )], s a s, z Pr(a|n) = 1; Pr(a, n |n, z) = Pr(a|n), a a n Pr(a|n) 0, a; Pr(a, n |n, z) 0, a, z Table 2: LP to improve the value function of a node in a non-uniform way according to the steady state occupancy o(s, n). 3 Bounded Policy Iteration We briefly review the bounded policy iteration (BPI) algorithm (see [14] for details) and describe a simple extension to bias its search toward reachable belief states. BPI incre- mentally constructs an FSC by alternating policy improvement and policy evaluation. Un- like policy iteration [7], this is done by slowly increasing the number of nodes (and value vectors). The policy improvement step greedily improves each node n by optimizing its action and observation strategies by solving the linear program (LP) in Table 1. This LP uniformly maximizes the improvement in the value function by optimizing n's distribu- tions Pr(a, n |n, z). The policy evaluation step computes the value function of the current controller by solving Eq. 1. The algorithm monotonically improves the policy until con- vergence to a local optimum, at which point new nodes are introduced to escape the local optimum. BPI is guaranteed to converge to a policy that is optimal at the "tangent" belief states while slowly growing the size of the controller [14]. In practice, we often wish to find a policy suitable for a given initial belief state. Since only a small subset of belief space is often reachable, it is generally possible to construct much smaller policies tailored to the reachable region. We now describe a simple way to bias BPI's efforts toward the reachable region. Recall that the LP in Table 1 optimizes the parameters of a node to uniformly improve its value at all belief states. We propose a new LP (Table 2) that weighs the improvement by the (unnormalized) discounted occupancy distribution induced by the current policy. This accounts for belief states reachable for the node by aggregating them together. The (unnormalized) discounted occupancy distribution is given by: o(s, n ) = b0(s, n ) + o(s, n) Pr(a|n) Pr(z|a, s) Pr(n |n, z) s, n s, a, z, n The LP in Table 2 is obtained by introducing variables s, n for each s, replacing the ob- jective by o(s, n) s, n s, n and replacing in each constraint by the corresponding s, n. When using the modified LP, BPI naturally tries to improve the policy at the reachable be- lief states before the others. Since the modification ensures that the value function doesn't decrease at any belief state, focusing the efforts on reachable belief states won't decrease policy value at other belief states. Furthermore, though the policy is initially biased toward reachable states, BPI will eventually improve the policy for all belief states. T ~ ~ T T T f ~ b ~ b b' b' ~ ~ R R R R r r' Figure 1: Functional flow of a POMDP (dotted arrows) and a compressed POMDP (solid arrows). 4 Value-Directed Compression We briefly review the sufficient conditions for a lossless compression of POMDPs [13] and describe a simple new algorithm to obtain good lossy compressions. Belief states constitute a sufficient statistic summarizing all information available to the decision maker (i. e. , past actions and observations). However, as long as enough information is available to evaluate the value of each policy, one can still choose the best policy. Since belief states often contain information irrelevant to the estimation of future rewards, one can often compress belief states into some lower-dimensional representation. Let f be a compression function that maps each belief state b into some lower dimensional compressed belief state ~ b (see Figure 1). Here ~ b can be viewed as a bottleneck that filters the information contained in b before it is used to estimate future rewards. We desire a compression f such that ~ b corresponds to the smallest statistic sufficient for accurately predicting the current reward r as well as the next compressed belief state ~ b (since it captures all the information in b necessary to accurately predict subsequent rewards). Such a compression f exists if we can also find compressed transition dynamics ~ T a, z and a compressed reward function ~ R such that: R = ~ R f and f T a, z = ~ T a, z f a A, z Z (3) Given an f, ~ R and ~ T a, z satisfying Eq. 3, we can evaluate any policy using the compressed POMDP dynamics to obtain ~ V. Since V = ~ V f, the compressed POMDP is equivalent to the original. When restricting f to be linear (represented by matrix F ), we can rewrite Eq. 3 R = F ~ R and T a, zF = F ~ T a, z a A, z Z (4) That is, the column space of F spans R and is invariant w. r. t. each T a, z. Hence, the columns of the best linear lossless compression mapping F form a basis for the smallest invariant subspace (w. r. t. each T a, z) that spans R, i. e. , the Krylov subspace. We can find the columns of F by Krylov iteration: multiplying R by each T a, z until the newly generated vectors are linear combinations of previous ones. 1 The dimensionality of the compressed space is equal to the number of columns of F, which is necessarily smaller than or equal to the dimensionality of the original belief space. Once F is found, we can compute ~ R and each ~ T a, z by solving the system in Eq. 4. Since linear lossless compressions are not always possible, we can extend the technique of [13] to find good lossy compressions with early stopping of the Krylov iteration. We retain only the vectors that are "far" from being linear combinations of prior vectors. For instance, if v is a linear combination of v1, v2, .. ., vn, then there are coefficients c1, c2, .. ., cn s. t. the error ||v - c i i vi ||2 is zero. Given a threshold or some upper bound k on the desired number of columns in F, we run Krylov iteration, retaining only the vectors with an error greater than, or the k vectors with largest error. When F is computed by approximate 1For numerical stability, one must orthogonalize each vector before multiplying by T a, z. Krylov iteration, we cannot compute ~ R and ~ T a, z by solving the linear system in Eq. 4-- due to the lossy nature of the compression, the system is overconstrained. But we can find suitable ~ R and ~ T a, z by computing a least square approximation, solving: F R = F F ~ R and F T a, zF = F F ~ T a, z a A, z Z While compression is required when the dimensionality of belief space is too large, unfortu- nately, the columns of F have the same dimensionality. Factored POMDPs of exponential dimension can, however, admit practical Krylov iteration if carried out using a compact representation (e. g. , DTs or ADDs) to efficiently compute F, ~ R and each ~ T a, z. 5 Bounded Policy Iteration with Value-Directed Compression In principle, any POMDP algorithm can be used to solve the compressed POMDPs pro- duced by VDC. If the compression is lossless and the POMDP algorithm exact, the com- puted policy will be optimal for the original POMDP. In practice, POMDP algorithms are usually approximate and lossless compressions are not always possible, so care must be taken to ensure numerical stability and a policy of high quality for the original POMDP. We now discuss some of the integration issues that arise when combining VDC with BPI. Since V = F ~ V, maximizing the compressed value vector ~ V of some node n automatically maximizes the value V of n w. r. t. the original POMDP when F is nonnegative; hence it is essential that F be nonnegative. Otherwise, the optimal policy of the compressed POMDP may not be optimal for the original POMDP. Fortunately, when R is nonnegative then F is guaranteed to be nonnegative by the nature of Krylov iteration. If some rewards are negative, we can add a sufficiently large constant to R to make it nonnegative without changing the decision problem. Since most algorithms, including BPI, compute approximately optimal policies it is also critical to normalize the columns of F. Suppose F has two columns f1 and f2 with L1- lengths 1 and 100, respectively. Since V = F ~ V = ~ v1f1 + ~v2f2, changes in ~v2 have a much greater impact on V than changes in ~ v1. Such a difference in sensitivity may bias the search for a good policy to an undesirable region of the belief space, or may even cause the algorithm to return a policy that is far from optimal for the original POMDP despite the fact that it is -optimal for the compressed POMDP. We note that it is "safer" to evaluate policies iteratively by successive approximation rather than solving the system in Eq. 1. By definition, the transition matrices T a, z have eigen- values with magnitude 1. In contrast, lossy compressed transition matrices ~ T a, z are not guaranteed to have this property. Hence, solving the system in Eq. 1 may not correspond to policy evaluation. It is thus safer to evaluate policies by successive approximation for lossy compressions. Finally several algorithms including BPI compute witness belief states to verify the domi- nance of a value vector. Since the compressed belief space ~ B is different from the original belief space B, this must be approached with care. B is a simplex corresponding to the convex hull of the state points. In contrast, since each row vector of F is the compressed version of some state point, ~ B corresponds to the convex hull of the row vectors of F. When F is non-negative, it is often possible to ignore this difference. For instance, when verifying the dominance of a value vector, if there is a compressed witness ~ b, there is al- ways an uncompressed witness b, but not vice-versa. This means that we can properly identify all dominating value vectors, but we may erroneously classify a dominated vector as dominating. In practice, this doesn't impact the correctness of algorithms such as policy iteration, bounded policy iteration, incremental pruning, witness algorithm, etc. but it will slow them down since they won't be able to prune as many value vectors as possible. cycle16 cycle19 cycle22 105 120 100 130 115 95 125 110 90 Expected Rewards Expected Rewards 105 Expected Rewards 120 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes cycle25 3legs16 3legs19 120 135 150 115 130 145 110 125 105 120 140 100 115 Expected Rewards Expected Rewards Expected Rewards 135 95 110 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes 3legs22 3legs25 cycle25 150 12 145 160 10 8 140 155 6 135 150 4 130 145 Expected Rewards Expected Rewards 2 Time (1000 seconds) 125 140 250 250 250 200 120 200 120 200 120 100 150 100 100 80 150 80 150 80 60 100 60 60 40 100 40 100 40 20 50 20 20 # of basis fns 50 50 # of nodes # of basis fns # of nodes # of basis fns # of nodes Figure 2: Experimental results for cycle and 3legs network configurations of 16, 19, 22 and 25 machines. The bottom right graph shows the running time of BPI on compressed versions of a cycle network of 25 machines. 3legs cycle 16 19 22 25 16 19 22 25 VDCBPI 120. 9 137. 0 151. 0 164. 8 103. 9 121. 3 134. 3 151. 4 heuristic 100. 6 118. 3 138. 3 152. 3 102. 5 117. 9 130. 2 152. 3 doNothing 98. 4 112. 9 133. 5 147. 1 91. 6 105. 4 122. 0 140. 1 Table 3: Comparison of the best policies achieved by VDCBPI to the doNothing and heuristic policies. The above tips work well when VDC is integrated with BPI. We believe they are sufficient to ensure proper integration of VDC with other POMDP algorithms, though we haven't verified this empirically.

NeurIPS Conference 2003 Conference Paper

Bounded Finite State Controllers

  • Pascal Poupart
  • Craig Boutilier

We describe a new approximation algorithm for solving partially observ- able MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining sev- eral advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).

NeurIPS Conference 2002 Conference Paper

Value-Directed Compression of POMDPs

  • Pascal Poupart
  • Craig Boutilier

We examine the problem of generating state-space compressions of POMDPs in a way that minimally impacts decision quality. We analyze the impact of compres- sions on decision quality, observing that compressions that allow accurate policy evaluation (prediction of expected future reward) will not affect decision qual- ity. We derive a set of sufficient conditions that ensure accurate prediction in this respect, illustrate interesting mathematical properties these confer on lossless lin- ear compressions, and use these to derive an iterative procedure for finding good linear lossy compressions. We also elaborate on how structured representations of a POMDP can be used to find such compressions.

UAI Conference 2001 Conference Paper

Value-Directed Sampling Methods for POMDPs

  • Pascal Poupart
  • Luis E. Ortiz
  • Craig Boutilier

We consider the problem of approximate belief-state monitoring using particle filtering for the purposes of implementing a policy for a partially-observable Markov decision process (POMDP). While particle filtering has become a widely-used tool in AI for monitoring dynamical systems, rather scant attention has been paid to their use in the context of decision making. Assuming the existence of a value function, we derive error bounds on decision quality associated with filtering using importance sampling. We also describe an adaptive procedure that can be used to dynamically determine the number of samples required to meet specific error bounds. Empirical evidence is offered supporting this technique as a profitable means of directing sampling effort where it is needed to distinguish policies.

UAI Conference 2001 Conference Paper

Vector-space Analysis of Belief-state Approximation for POMDPs

  • Pascal Poupart
  • Craig Boutilier

We propose a new approach to value-directed belief state approximation for POMDPs. The value-directed model allows one to choose approximation methods for belief state monitoring that have a small impact on decision quality. Using a vector space analysis of the problem, we devise two new search procedures for selecting an approximation scheme that have much better computational properties than existing methods. Though these provide looser error bounds, we show empirically that they have a similar impact on decision quality in practice, and run up to two orders of magnitude more quickly.

UAI Conference 2000 Conference Paper

Value-Directed Belief State Approximation for POMDPs

  • Pascal Poupart
  • Craig Boutilier

We consider the problem belief-state monitoring for the purposes of implementing a policy for a partially-observable Markov decision process (POMDP), specifically how one might approximate the belief state. Other schemes for belief-state approximation (e.g., based on minimixing a measures such as KL-diveregence between the true and estimated state) are not necessarily appropriate for POMDPs. Instead we propose a framework for analyzing value-directed approximation schemes, where approximation quality is determined by the expected error in utility rather than by the error in the belief state itself. We propose heuristic methods for finding good projection schemes for belief state estimation - exhibiting anytime characteristics - given a POMDP value fucntion. We also describe several algorithms for constructing bounds on the error in decision quality (expected utility) associated with acting in accordance with a given belief state approximation.