Author name cluster

Scott Sanner

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

81 papers

2 author rows

AAAI Conference 2026 Conference Paper

Efficient Modality Translation via Arbitrary Conditioning and Wasserstein Regularization

Tomas Tokar
Scott Sanner

The central challenge in multimodal generative modeling lies in accurately approximating the joint data distribution, even when some modalities are missing. Existing multimodal VAEs solve this by designing increasingly complex encoding architectures, relying on modality-specific encoders, factorized posteriors, and custom inference procedures. This restricts their ability to capture relations among modalities by amortizing the encoding parameters. We challenge this paradigm by introducing a model trained for arbitrary conditioning, i.e., generating any modality given a subset of observed modalities and a logical index indicating which modalities are present or missing. This enables a single unified encoder to handle any subset of modalities while capturing inter-modal relationships via a compact, shared posterior. We find that to work efficiently in the multimodal setup, arbitrary conditioning requires replacing the KL divergence with Wasserstein regularization, which allows more dispersed latent embeddings to support learning over diverse data and modality subsets. This key insight exposes a critical deficiency in existing methods, which rely on KL regularization that tends to concentrate individual embeddings near the standard Gaussian prior, despite coming from very diverse subsets of multimodal inputs. We prove that Wasserstein regularization ensures that the aggregate latent distribution -- spanning all conditioning subsets -- aligns with the prior without requiring mixture models or auxiliary inference tricks. Empirically, the proposed model improves cross-modal generation and yields better reconstructions than state-of-the-art multimodal VAEs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Near-optimal Linear Predictive Clustering in Non-separable Spaces via MIP and QPBO Reductions

Jiazhou Liang
Hassan Khurram
Scott Sanner

Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, previous works formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but at the cost of poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving additional computational gains in the special case of two clusters. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

ActiveVOO: Value of Observation Guided Active Knowledge Acquisition for Open-World Embodied Lifted Regression Planning

Xiaotian Liu
Ali Pesaranghader
Jaehong Kim
Tanmana Sadhu
Hyejeong Jeon
Scott Sanner

The ability to actively acquire information is essential for open-world planning under partial observability and incomplete knowledge. However, most existing embodied AI systems either assume a known object category or rely on passive perception strategies that exhaustively gather object and relational information from the environment. Such a strategy becomes insufficient in visually complex open-world settings. For instance, a typical household may contain thousands of novel and uniquely configured objects, most of which are irrelevant to the agent’s current task. Consequently, open-world agents must be capable of actively identifying and prioritizing task-relevant objects to enable efficient and goal-directed knowledge acquisition. In this work, we introduce ActiveVOO, a novel zero-shot framework for open-world embodied planning that emphasizes object-centric active knowledge acquisition. ActiveVOO employs lifted regression to generate compact, first-order subgoal descriptions that identify task-relevant objects, and provides a principled mechanism to quantify the utility of sensing actions based on commonsense priors derived from LLMs and VLMs. We evaluate ActiveVOO on the visual ALFWorld benchmark, where it achieves substantial improvements over existing LLM- and VLM-based planning approaches, notably outperforming VLMs fine-tuned on ALFWorld data. This work establishes a principled foundation for developing embodied agents capable of actively and efficiently acquiring knowledge to plan and act in open-world environments.

PDF Details

AAAI Conference 2025 Conference Paper

ICE-T: Interactions-aware Cross-column Contrastive Embedding for Heterogeneous Tabular Datasets

Tomas Tokar
Scott Sanner

Finding high-quality representations of heterogeneous tabular datasets is crucial for their effective use in downstream machine learning tasks. Contrastive representation learning (CRL) methods have been previously shown to provide a straightforward way to learn such representations across various data domains. Current tabular CRL methods learn joint embeddings of data instances (tabular rows) by minimizing a contrastive loss between the original instance and its perturbations. Unlike existing tabular CRL methods, we propose leveraging frameworks established in multimodal representation learning, treating each tabular column as a distinct modality. A naive approach that applies a contrastive loss pairwise to tabular columns is not only prohibitively expensive as the number of columns increases, but as we demonstrate, it also fails to capture interactions between variables. Instead, we propose a novel method called ICE-T that learns each columnar embedding by contrasting it with aggregate embeddings of the complementary part of the table, thus capturing interactions and scaling linearly with the number of columns. Unlike existing tabular CRL methods, ICE-T allows for column-specific embeddings to be obtained independently of the rest of the table, enabling the inference of missing values and translation between columnar variables. We provide a comprehensive evaluation of ICE-T across diverse datasets, demonstrating that it generally surpasses the performance of the state-of-the-art alternatives.

PDF Details DOI

ICLR Conference 2025 Conference Paper

LLM-based Typed Hyperresolution for Commonsense Reasoning with Knowledge Bases

Armin Toroghi
Ali Pesaranghader
Tanmana Sadhu
Scott Sanner

Large language models (LLM) are being increasingly applied to tasks requiring commonsense reasoning. Despite their outstanding potential, the reasoning process of LLMs is prone to errors and hallucinations that hinder their applicability, especially in high-stakes scenarios. Several works have attempted to enhance commonsense reasoning performance of LLMs by (i) using prompting styles that elicit more accurate reasoning, (ii) utilizing the LLM as a semantic parser for a symbolic reasoner, or (iii) enforcing the LLM to simulate a logical inference rule. However, all these solutions have critical limitations: they are unable to leverage the internal commonsense knowledge of the LLM in tandem with an axiomatic knowledge base, they lack a mechanism to reliably repair erroneous inference steps, and their application is restricted to small knowledge bases that fit the context limit of the LLM. In this work, we present LLM-based Typed Hyperresolution (LLM-TH), a logical commonsense reasoning framework that leverages "theory resolution", a concept from classical logical inference which enables integrating LLMs into the "resolution" inference rule, thus mitigating reasoning errors and hallucinations and enabling verification of the reasoning procedure. LLM-TH is also equipped with a mechanism for repairing erroneous inference steps supported by theoretical guarantees. Using "Hyperresolution" and "Typed inference" schemes, we show that LLM-TH can efficiently reason over large knowledge bases consisting of tens of thousands of rules with arbitrary predicate arities. Our experiments on three diverse language-based reasoning tasks—preference reasoning, multi-domain deductive reasoning, and geographical question answering—showcase that LLM-TH, using merely a BART 406M parameter NLI entailment model, significantly reduces reasoning errors compared to baselines using Llama3-70B, Gemini1.5-Flash, GPT-3.5-Turbo, and Mixtral-46.7B.

Details

AAAI Conference 2025 Conference Paper

ModelDiff: Symbolic Dynamic Programming for Model-Aware Policy Transfer in Deep Q-Learning

Xiaotian Liu
Jihwan Jeong
Ayal Taitler
Michael Gimelfarb
Scott Sanner

Despite significant recent advances in the field of Deep Reinforcement Learning (DRL), such methods typically incur high cost of training to learn effective policies, thus posing cost and safety challenges in many practical applications. To improve the learning efficiency of (D)RL methods, transfer learning (TL) has emerged as a promising approach to leverage prior experience on a source domain to speed learning on a new, but related, target domain. In this paper, we take a novel model-informed approach to TL in DRL by assuming that we have knowledge of both the source and target domain models (which would be the case in the prevalent setting of DRL with simulators). While directly solving either the source or target MDP via solution methods like value iteration is computationally prohibitive, we exploit the fact that if the target and source MDPs differ only due to a small structural change in their rewards, we can apply structured value iteration methods in a procedure we term ModelDiff to solve the much smaller target-source ``Diff'' MDP for a reasonable horizon. This ModelDiff approach can then be integrated into extensions of standard DRL algorithms like ModelDiff (MD) DQN, where it provides enhanced provable lower bound guidance to DQN that often speeds convergence for the positive transfer case while critically avoiding decelerated learning in the negative transfer case. Experiments show that MD-DQN matches or outperforms existing TL methods and baselines in both positive and negative transfer settings.

PDF Details DOI

ICML Conference 2025 Conference Paper

Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

Jihwan Jeong
Xiaoyu Wang 0018
Jingmin Wang
Scott Sanner
Pascal Poupart

Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a novel doubly Bayesian offline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

Details

ICML Conference 2025 Conference Paper

Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction

Yudong Xu 0001
Wenhao Li
Scott Sanner
Elias B. Khalil

We present a Transformer-based framework for Constraint Satisfaction Problems (CSPs). CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NP-Complete CSPs or large training budgets and a complex expert-designed reward signal. To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Experiments on Sudoku, Graph Coloring, Nurse Rostering, and MAXCUT demonstrate that our method can tackle out-of-distribution CSPs simply through additional iterations.

Details

TMLR Journal 2025 Journal Article

Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

Wenhao Li
Yudong Xu
Scott Sanner
Elias Boutros Khalil

The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires solving a program synthesis problem over small 2D images using a few input-output training pairs. In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that underlies the task. We show that a ViT—otherwise a state-of-the-art model for images—fails dra- matically on most ARC tasks even when trained on one million examples per task. This points to an inherent representational deficiency of the ViT architecture that makes it incapable of uncov- ering the simple structured mappings underlying the ARC tasks. Building on these insights, we propose VITARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities re- quired by the ARC. Specifically, we use a pixel-level input representation, design a spatially-aware tokenization scheme, and introduce a novel object-based positional encoding that leverages auto- matic segmentation, among other enhancements. Our task-specific VITARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks strictly through supervised learning from input-output grids. This calls attention to the importance of imbuing the powerful (Vision) Transformer with the correct inductive biases for abstract visual reasoning that are critical even when the training data is plentiful and the mapping is noise-free. Hence, VITARC provides a strong foundation for future research in visual reasoning using transformer-based architectures.

PDF Details

AAAI Conference 2024 Conference Paper

Bayesian Inference with Complex Knowledge Graph Evidence

Armin Toroghi
Scott Sanner

Knowledge Graphs (KGs) provide a widely used format for representing entities and their relationships and have found use in diverse applications including question answering and recommendation. A majority of current research on KG inference has focused on reasoning with atomic facts (triples) and has disregarded the possibility of making complex evidential observations involving logical operators (negation, conjunction, disjunction) and quantifiers (existential, universal). Further, while the application of complex evidence has been explored in KG-based query answering (KGQA) research, in many practical online settings, observations are made sequentially. For example, in KGQA, additional context may be incrementally suggested to narrow down the answer. Or in interactive recommendation, user critiques may be expressed sequentially in order to narrow down a set of preferred items. Both settings are indicative of information filtering or tracking tasks that are reminiscent of belief tracking in Bayesian inference. In fact, in this paper, we precisely cast the problem of belief tracking over unknown KG entities given incremental complex KG evidence as a Bayesian filtering problem. Specifically, we leverage Knowledge-based Model Construction (KBMC) over the logical KG evidence to instantiate a Markov Random Field (MRF) likelihood representation to perform closed-form Bayesian inference with complex KG evidence (BIKG). We experimentally evaluate BIKG in incremental KGQA and interactive recommendation tasks demonstrating that it outperforms non-incremental methodologies and leads to better incorporation of conjunctive evidence vs. existing complex KGQA methods like CQD that leverage fuzzy T-norm operators. Overall, this work demonstrates a novel, efficient, and unified perspective of logic, KGs, and online inference through the lens of closed-form BIKG.

PDF Details DOI

ICAPS Conference 2024 Conference Paper

JaxPlan and GurobiPlan: Optimization Baselines for Replanning in Discrete and Mixed Discrete-Continuous Probabilistic Domains

Michael Gimelfarb
Ayal Taitler
Scott Sanner

Replanning methods that determinize a stochastic planning problem and replan at each action step have long been known to provide strong baseline (and even competition winning) solutions to discrete probabilistic planning problems. Recent work has explored the extension of replanning methods to the case of mixed discrete-continuous probabilistic domains by leveraging MILP compilations of the RDDL specification language. Other recent advances in probabilistic planning have explored the compilation of structured mixed discrete-continuous RDDL domains into a determinized computation graph that also lends itself to replanning via so-called planning by backpropagation methods. However, to date, there has not been any comprehensive comparison of these recent optimization-based replanning methodologies to the state-of-the-art winner of the discrete probabilistic IPC 2011 and 2014 and runner-up in 2018 (PROST) and the winner of the mixed discrete-continuous probabilistic IPC 2023 (DiSProd). In this paper, we describe JaxPlan, which makes several extensive upgrades to planning by backpropagation and its compact tensorized compilation from RDDL to a JAX computation graph that uses discrete relaxations and a sample average approximation. We also provide the first detailed overview of a compilation of the RDDL language specification to Gurobi

Details

TMLR Journal 2024 Journal Article

LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

Yudong Xu
Wenhao Li
Pashootan Vaezipoor
Scott Sanner
Elias Boutros Khalil

Can a Large Language Model (LLM) solve simple abstract reasoning problems? We explore this broad question through a systematic analysis of GPT on the Abstraction and Reasoning Corpus (ARC), a representative benchmark of abstract reasoning ability from limited examples in which solutions require some "core knowledge" of concepts such as objects, goal states, counting, and basic geometry. GPT-4 solves only 13/50 of the most straightforward ARC tasks when using textual encodings for their two-dimensional input-output grids. Our failure analysis reveals that GPT-4's capacity to identify objects and reason about them is significantly influenced by the sequential nature of the text that represents an object within a text encoding of a task. To test this hypothesis, we design a new benchmark, the 1D-ARC, which consists of one-dimensional (array-like) tasks that are more conducive to GPT-based reasoning, and where it indeed performs better than on the (2D) ARC. To alleviate this issue, we propose an object-based representation that is obtained through an external tool, resulting in nearly doubling the performance on solved ARC tasks and near-perfect scores on the easier 1D-ARC. Although the state-of-the-art GPT-4 is unable to "reason" perfectly within non-language domains such as the 1D-ARC or a simple ARC subset, our study reveals that the use of object-based representations can significantly improve its reasoning ability. Visualizations, GPT logs, and data are available at https://khalil-research.github.io/LLM4ARC.

PDF Details

PRL Workshop 2024 Workshop Paper

ModelDiff: Leveraging Models for Policy Transfer with Value Lower Bounds

Xiaotian Liu
Jihwan Jeong
Ayal Taitler
Michael Gimelfarb
Scott Sanner

Despite significant recent advances in the field of Deep Reinforcement Learning (DRL), such methods typically incur high cost of training to learn effective policies, thus posing cost and safety challenges in many practical applications. To improve the learning efficiency of (D)RL methods, transfer learning (TL) has emerged as a promising approach to leverage prior experience on a source domain to speed learning on a new, but related, target domain. In this paper, we take a novel model-informed approach to TL in DRL by assuming that we have knowledge of both the source and target domain models (which would be the case in the prevalent setting of DRL with simulators). While directly solving either the source or target MDP via solution methods like value iteration is computationally prohibitive, we exploit the fact that if the target and source MDPs differ only due to a small structural change in their rewards, we can apply structured value iteration methods in a procedure we term ModelDiff to solve the much smaller target−source “Diff” MDP for a reasonable horizon. This ModelDiff approach can then be integrated into extensions of standard DRL algorithms like lower bound (LB) DQN where it provides enhanced provable LB guidance to DQN that speeds convergence. Experiments show that our ModelDiff LB-DQN matches or outperforms existing TL methods and baselines in both positive and negative transfer settings.

PDF Details

ICLR Conference 2023 Conference Paper

Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

Jihwan Jeong
Xiaoyu Wang 0018
Michael Gimelfarb
Hyunwoo Kim
Baher Abdulhai
Scott Sanner

Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.

Details

AAAI Conference 2023 Conference Paper

Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus

Yudong Xu
Elias B. Khalil
Scott Sanner

The Abstraction and Reasoning Corpus (ARC) aims at benchmarking the performance of general artificial intelligence algorithms. The ARC's focus on broad generalization and few-shot learning has made it difficult to solve using pure machine learning. A more promising approach has been to perform program synthesis within an appropriately designed Domain Specific Language (DSL). However, these too have seen limited success. We propose Abstract Reasoning with Graph Abstractions (ARGA), a new object-centric framework that first represents images using graphs and then performs a search for a correct program in a DSL that is based on the abstracted graph space. The complexity of this combinatorial search is tamed through the use of constraint acquisition, state hashing, and Tabu search. An extensive set of experiments demonstrates the promise of ARGA in tackling some of the complicated object-centric tasks of the ARC rather efficiently, producing programs that are correct and easy to understand.

PDF Details DOI

PRL Workshop 2023 Workshop Paper

pyRDDLGym: From RDDL to Gym Environments

Ayal Taitler
Michael Gimelfarb
Jihwan Jeong
Sriram Gopalakrishnan
Martin Mladenov
Xiaotian Liu
Scott Sanner

We present pyRDDLGym, a Python framework for the auto-generation of OpenAI Gym environments from RDDL declarative description. The discrete time step evolution of variables in RDDL is described by conditional probability functions, which fit naturally into the Gym step scheme. Furthermore, since RDDL is a lifted description, the modification and scaling up of environments to support multiple entities and different configurations becomes trivial rather than a tedious process prone to errors. We hope that pyRDDLGym will serve as a new wind in the reinforcement learning community by enabling easy and rapid development of benchmarks due to the unique expressive power of RDDL. By providing explicit access to the model in the RDDL description, pyRDDLGym can also facilitate research on hybrid approaches to learning from interaction while leveraging model knowledge. We present the design and built-in examples of pyRDDLGym, and the additions made to the RDDL language that were incorporated into the framework.

PDF Details

ICAPS Conference 2023 Conference Paper

Safe MDP Planning by Learning Temporal Patterns of Undesirable Trajectories and Averting Negative Side Effects

Siow Meng Low
Akshat Kumar
Scott Sanner

In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. In real world, often the state representation used may lack sufficient fidelity to specify such safety constraints. Operating based on an incomplete model can often produce unintended negative side effects (NSEs). To address these challenges, first, we associate safety signals with state-action trajectories (rather than just immediate state-action). This makes our safety model highly general. We also assume categorical safety labels are given for different trajectories, rather than a numerical cost function, which is harder to specify by the problem designer. We then employ a supervised learning model to learn such non-Markovian safety patterns. Second, we develop a Lagrange multiplier method, which incorporates the safety model and the underlying MDP model in a single computation graph to facilitate agent learning of safe behaviors. Finally, our empirical results on a variety of discrete and continuous domains show that this approach can satisfy complex non-Markovian safety constraints while optimizing agent

Details

AAAI Conference 2023 Conference Paper

Scalable and Globally Optimal Generalized L₁ K-center Clustering via Constraint Generation in Mixed Integer Linear Programming

Aravinth Chembu
Scott Sanner
Hassan Khurram
Akshat Kumar

The k-center clustering algorithm, introduced over 35 years ago, is known to be robust to class imbalance prevalent in many clustering problems and has various applications such as data summarization, document clustering, and facility location determination. Unfortunately, existing k-center algorithms provide highly suboptimal solutions that can limit their practical application, reproducibility, and clustering quality. In this paper, we provide a novel scalable and globally optimal solution to a popular variant of the k-center problem known as generalized L_1 k-center clustering that uses L_1 distance and allows the selection of arbitrary vectors as cluster centers. We show that this clustering objective can be reduced to a mixed-integer linear program (MILP) that facilitates globally optimal clustering solutions. However, solving such a MILP may be intractable for large datasets; to remedy this, we present a scalable algorithm that leverages constraint generation to efficiently and provably converge to its global optimum. We further enhance outlier handling through a simple but elegant extension to our MILP objective. We first evaluate our algorithm on a variety of synthetic datasets to better understand its properties and then validate on 20 real benchmark datasets where we compare its performance to both traditional L_1 distance k-center and k-medians baselines. Our results demonstrate significant suboptimality of existing algorithms in comparison to our approach and further demonstrate that we can find optimal generalized L_1 k-center clustering solutions up to an unprecedented 1,000,000 data points.

PDF Details DOI

AAAI Conference 2022 Conference Paper

A Distributional Framework for Risk-Sensitive End-to-End Planning in Continuous MDPs

Noah Patton
Jihwan Jeong
Mike Gimelfarb
Scott Sanner

Recent advances in efficient planning in deterministic or stochastic high-dimensional domains with continuous action spaces leverage backpropagation through a model of the environment to directly optimize action sequences. However, existing methods typically do not take risk into account when optimizing in stochastic domains, which can be incorporated efficiently in MDPs by optimizing a nonlinear utility function of the return distribution. We bridge this gap by introducing Risk-Aware Planning using PyTorch (RAPTOR), a novel unified framework for risk-sensitive planning through end-toend optimization of commonly-studied risk-sensitive utility functions such as entropic utility, mean-variance optimization and CVaR. A key technical difficulty of our approach is that direct optimization of general risk-sensitive utility functions by backpropagation is impossible due to the presence of environment stochasticity. The novelty of RAPTOR lies in leveraging reparameterization of the state distribution, leading to a unique distributional perspective of end-to-end planning where the return distribution is utilized for sampling as well as optimizing risk-aware objectives by backpropagation in a unified framework. We evaluate and compare RAPTOR on three highly stochastic MDPs, including nonlinear navigation, HVAC control, and linear reservoir control, demonstrating the ability of RAPTOR to manage risk in complex continuous domains according to different notions of risk-sensitive utility.

PDF Details

ICML Conference 2022 Conference Paper

An Exact Symbolic Reduction of Linear Smart Predict+Optimize to Mixed Integer Linear Programming

Jihwan Jeong
Parth Jaggi
Andrew Butler
Scott Sanner

Predictive models are traditionally optimized independently of their use in downstream decision-based optimization. The ‘smart, predict then optimize’ (SPO) framework addresses this shortcoming by optimizing predictive models in order to minimize the final downstream decision loss. To date, several local first-order methods and convex approximations have been proposed. These methods have proven to be effective in practice, however, it remains generally unclear as to how close these local solutions are to global optimality. In this paper, we cast the SPO problem as a bi-level program and apply Symbolic Variable Elimination (SVE) to analytically solve the lower optimization. The resulting program can then be formulated as a mixed-integer linear program (MILP) which is solved to global optimality using standard off-the-shelf solvers. To our knowledge, our framework is the first to provide a globally optimal solution to the linear SPO problem. Experimental results comparing with state-of-the-art local SPO solvers show that the globally optimal solution obtains up to two orders of magnitude reduction in decision regret.

Details

NeurIPS Conference 2022 Conference Paper

Learning to Follow Instructions in Text-Based Games

Mathieu Tuli
Andrew Li
Pashootan Vaezipoor
Toryn Klassen
Scott Sanner
Sheila McIlraith

Text-based games present a unique class of sequential decision making problem in which agents interact with a partially observable, simulated environment via actions and observations conveyed through natural language. Such observations typically include instructions that, in a reinforcement learning (RL) setting, can directly or indirectly guide a player towards completing reward-worthy tasks. In this work, we study the ability of RL agents to follow such instructions. We conduct experiments that show that the performance of state-of-the-art text-based game agents is largely unaffected by the presence or absence of such instructions, and that these agents are typically unable to execute tasks to completion. To further study and address the task of instruction following, we equip RL agents with an internal structured representation of natural language instructions in the form of Linear Temporal Logic (LTL), a formal language that is increasingly used for temporally extended reward specification in RL. Our framework both supports and highlights the benefit of understanding the temporal semantics of instructions and in measuring progress towards achievement of such a temporally extended behaviour. Experiments with 500+ games in TextWorld demonstrate the superior performance of our approach.

PDF Details

AAAI Conference 2022 Conference Paper

Sample-Efficient Iterative Lower Bound Optimization of Deep Reactive Policies for Planning in Continuous MDPs

Siow Meng Low
Akshat Kumar
Scott Sanner

Recent advances in deep learning have enabled optimization of deep reactive policies (DRPs) for continuous MDP planning by encoding a parametric policy as a deep neural network and exploiting automatic differentiation in an end-toend model-based gradient descent framework. This approach has proven effective for optimizing DRPs in nonlinear continuous MDPs, but it requires a large number of sampled trajectories to learn effectively and can suffer from high variance in solution quality. In this work, we revisit the overall model-based DRP objective and instead take a minorizationmaximization perspective to iteratively optimize the DRP w. r. t. a locally tight lower-bounded objective. This novel formulation of DRP learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective, (ii) it guarantees a monotonically improving objective under certain theoretical conditions, and (iii) it reuses samples between iterations thus lowering sample complexity. Empirical evaluation confirms that ILBO is significantly more sampleefficient than the state-of-the-art DRP planner and consistently produces better solution quality with lower variance. We additionally demonstrate that ILBO generalizes well to new problem instances (i. e. , different initial states) without requiring retraining.

PDF Details

IJCAI Conference 2021 Conference Paper

Bayesian Experience Reuse for Learning from Multiple Demonstrators

Mike Gimelfarb
Scott Sanner
Chi-Guhn Lee

Learning from Demonstrations (LfD) is a powerful approach for incorporating advice from experts in the form of demonstrations. However, demonstrations often come from multiple sub-optimal experts with conflicting goals, rendering them difficult to incorporate effectively in online settings. To address this, we formulate a quadratic program whose solution yields an adaptive weighting over experts, that can be used to sample experts with relevant goals. In order to compare different source and target task goals safely, we model their uncertainty using normal-inverse-gamma priors, whose posteriors are learned from demonstrations using Bayesian neural networks with a shared encoder. Our resulting approach, which we call Bayesian Experience Reuse, can be applied for LfD in static and dynamic decision-making settings. We demonstrate its effectiveness for minimizing multi-modal functions, and optimizing a high-dimensional supply chain with cost uncertainty, where it is also shown to improve upon the performance of the demonstrators' policies.

PDF Details DOI

UAI Conference 2021 Conference Paper

Contextual policy transfer in reinforcement learning domains via deep mixtures-of-experts

Michael Gimelfarb
Scott Sanner
Chi-Guhn Lee

In reinforcement learning, agents that consider the context or current state when transferring source policies have been shown to outperform context-free approaches. However, existing approaches suffer from limitations, including sensitivity to sparse or delayed rewards and estimation errors in values. One important insight is that explicit learned models of the source dynamics, when available, could benefit contextual transfer in such settings. In this paper, we assume a family of tasks with shared sub-goals but different dynamics, and availability of estimated dynamics and policies for source tasks. To deal with possible estimation errors in dynamics, we introduce a novel Bayesian mixture-of-experts for learning state-dependent beliefs over source task dynamics that match the target dynamics using state transitions collected from the target task. The mixture is easy to interpret, is robust to estimation errors in dynamics, and is compatible with most RL algorithms. We incorporate it into standard policy reuse frameworks and demonstrate its effectiveness on benchmarks from OpenAI gym.

Details

AAAI Conference 2021 Conference Paper

Online Class-Incremental Continual Learning with Adversarial Shapley Value

Dongsub Shim
Zheda Mai
Jihwan Jeong
Scott Sanner
Hyunwoo Kim
Jongseong Jang

As image-based deep learning becomes pervasive on every device from cell phones to smart watches, there is a growing need to develop methods that continually learn from data while minimizing memory footprint and power consumption. While memory replay techniques have shown exceptional promise for this task of continual learning, the best method for selecting which buffered images to replay is still an open question. In this paper, we specifically focus on the online class-incremental setting where a model needs to learn new classes continually from an online data stream. To this end, we contribute a novel Adversarial Shapley value scoring method that scores memory data samples according to their ability to preserve latent decision boundaries for previously observed classes (to maintain learning stability and avoid forgetting) while interfering with latent decision boundaries of current classes being learned (to encourage plasticity and optimal learning of new class boundaries). Overall, we observe that our proposed ASER method provides competitive or improved performance compared to state-of-the-art replaybased continual learning methods on a variety of datasets.

PDF Details

NeurIPS Conference 2021 Conference Paper

Representer Point Selection via Local Jacobian Expansion for Post-hoc Classifier Explanation of Deep Neural Networks and Ensemble Models

Yi Sui
Ga Wu
Scott Sanner

Explaining the influence of training data on deep neural network predictions is a critical tool for debugging models through data curation. A recent tractable and appealing approach for this task was provided via the concept of Representer Point Selection (RPS), i. e. a method the leverages the dual form of $l_2$ regularized optimization in the last layer of the neural network to identify the contribution of training points to the prediction. However, two key drawbacks of RPS are that they (i) lead to disagreement between the originally trained network and the RP regularized network modification and (ii) often yield a static ranking of training data for the same class, independent of the data being classified. Inspired by the RPS approach, we propose an alternative method based on a local Jacobian Taylor expansion (LJE) of the Jacobian. We empirically compared RPS-LJE with the original RPS-$l_2$ on image classification (with ResNet), text classification recurrent neural networks (with Bi-LSTM), and tabular classification (with XGBoost) tasks. Quantitatively, we show that RPS-LJE slightly outperforms RPS-$l_2$ and other state-of-the-art data explanation methods by up to 3\% on a data debugging task. Qualitatively, we observe that RPS-LJE provides individualized explanations for each test data point rather than the class-specific static ranking of points in the original approach. Overall, RPS-LJE represents a novel approach to RPS that provides a powerful tool for data-oriented explanation and debugging.

PDF Details

NeurIPS Conference 2021 Conference Paper

Risk-Aware Transfer in Reinforcement Learning using Successor Features

Michael Gimelfarb
Andre Barreto
Scott Sanner
Chi-Guhn Lee

Sample efficiency and risk-awareness are central to the development of practical reinforcement learning (RL) for complex decision-making. The former can be addressed by transfer learning, while the latter by optimizing some utility function of the return. However, the problem of transferring skills in a risk-aware manner is not well-understood. In this paper, we address the problem of transferring policies between tasks in a common domain that differ only in their reward functions, in which risk is measured by the variance of reward streams. Our approach begins by extending the idea of generalized policy improvement to maximize entropic utilities, thus extending the dynamic programming's policy improvement operation to sets of policies \emph{and} levels of risk-aversion. Next, we extend the idea of successor features (SF), a value function representation that decouples the environment dynamics from the rewards, to capture the variance of returns. Our resulting risk-aware successor features (RaSF) integrate seamlessly within the RL framework, inherit the superior task generalization ability of SFs, while incorporating risk into the decision-making. Experiments on a discrete navigation domain and control of a simulated robotic arm demonstrate the ability of RaSFs to outperform alternative methods including SFs, when taking the risk of the learned policies into account.

PDF Details

PRL Workshop 2021 Workshop Paper

Scalable Risk-Sensitive Planning by Gradient Descent

Noah Patton
Jihwan Jeong
Michael Gimelfarb
Scott Sanner

Planning provides a framework for optimizing sequential decisions in potentially complex environments. A recent advance in efficient planning in deterministic high-dimensional domains with continuous action spaces leverages backpropagation through a model of the environment to directly optimize the actions. However, this method does not take risk into account when optimizing decisions in highly stochastic environments. We address this problem by introducing RiskAware Planning using PyTorch (RAPTOR), a framework that handles risk in stochastic planning domains through an endto-end optimization of entropic utility. While we cannot directly formalize the distributionally-defined entropic utility in closed-form for end-to-end planning, in settings where all MDP stochasticity is defined through the location-scale family, we can reparameterize the objective and apply stochastic backpropagation. What is notable in this approach is that the entropic utility is defined based on sufficient statistics computed from forward sampled trajectories, but due to the nature of autodifferentiation, we can still backpropagate through the entropic utility and these sufficient statistics. The resulting sequence of actions, which we call the risk-sensitive straightline plan, provides a lower bound on the utility of the optimal policy and can be seen as a form of hindsight optimization. We evaluate RAPTOR on two highly stochastic domains, including nonlinear navigation and linear reservoir control, demonstrating the ability to manage risk in complex MDPs.

PDF Details

IJCAI Conference 2021 Conference Paper

Symbolic Dynamic Programming for Continuous State MDPs with Linear Program Transitions

Jihwan Jeong
Parth Jaggi
Scott Sanner

Recent advances in symbolic dynamic programming (SDP) have significantly broadened the class of MDPs for which exact closed-form value functions can be derived. However, no existing solution methods can solve complex discrete and continuous state MDPs where a linear program determines state transitions --- transitions that are often required in problems with underlying constrained flow dynamics arising in problems ranging from traffic signal control to telecommunications bandwidth planning. In this paper, we present a novel SDP solution method for MDPs with LP transitions and continuous piecewise linear dynamics by introducing a novel, fully symbolic argmax operator. On three diverse domains, we show the first automated exact closed-form SDP solution to these challenging problems and the significant advantages of our SDP approach over discretized approximations.

PDF Details DOI

AIJ Journal 2020 Journal Article

Compact and efficient encodings for planning in factored state and action spaces with learned Binarized Neural Network transition models

Buser Say
Scott Sanner

In this paper, we leverage the efficiency of Binarized Neural Networks (BNNs) to learn complex state transition models of planning domains with discretized factored state and action spaces. In order to directly exploit this transition structure for planning, we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FD-SAT-Plan+) as well as Binary Linear Programming (FD-BLP-Plan+). Theoretically, we show that our SAT-based Bi-Directional Neuron Activation Encoding is asymptotically the most compact encoding relative to the current literature and supports Unit Propagation (UP) – an important property that facilitates efficiency in SAT solvers. Experimentally, we validate the computational efficiency of our Bi-Directional Neuron Activation Encoding in comparison to an existing neuron activation encoding and demonstrate the ability to learn complex transition models with BNNs. We test the runtime efficiency of both FD-SAT-Plan+ and FD-BLP-Plan+ on the learned factored planning problem showing that FD-SAT-Plan+ scales better with increasing BNN size and complexity. Finally, we present a finite-time incremental constraint generation algorithm based on generalized landmark constraints to improve the planning accuracy of our encodings through simulated or real-world interaction.

Details DOI

JAIR Journal 2020 Journal Article

Scalable Planning with Deep Neural Network Learned Transition Models

Ga Wu
Buser Say
Scott Sanner

In many complex planning problems with factored, continuous state and action spaces such as Reservoir Control, Heating Ventilation and Air Conditioning (HVAC), and Navigation domains, it is difficult to obtain a model of the complex nonlinear dynamics that govern state evolution. However, the ubiquity of modern sensors allows us to collect large quantities of data from each of these complex systems and build accurate, nonlinear deep neural network models of their state transitions. But there remains one major problem for the task of control – how can we plan with deep network learned transition models without resorting to Monte Carlo Tree Search and other black-box transition model techniques that ignore model structure and do not easily extend to continuous domains? In this paper, we introduce two types of planning methods that can leverage deep neural network learned transition models: Hybrid Deep MILP Planner (HD-MILP-Plan) and Tensorflow Planner (TF-Plan). In HD-MILP-Plan, we make the critical observation that the Rectified Linear Unit (ReLU) transfer function for deep networks not only allows faster convergence of model learning, but also permits a direct compilation of the deep network transition model to a Mixed-Integer Linear Program (MILP) encoding. Further, we identify deep network specific optimizations for HD-MILP-Plan that improve performance over a base encoding and show that we can plan optimally with respect to the learned deep networks. In TF-Plan, we take advantage of the efficiency of auto-differentiation tools and GPU-based computation where we encode a subclass of purely continuous planning problems as Recurrent Neural Networks and directly optimize the actions through backpropagation. We compare both planners and show that TF-Plan is able to approximate the optimal plans found by HD-MILP-Plan in less computation time. Hence this article offers two novel planners for continuous state and action domains with learned deep neural net transition models: one optimal method (HD-MILP-Plan) and a scalable alternative for large-scale problems (TF-Plan).

PDF Details DOI

AAAI Conference 2019 Conference Paper

Deep Reactive Policies for Planning in Stochastic Nonlinear Domains

Thiago P. Bueno
Leliane N. de Barros
Denis D. Mauá
Scott Sanner

Recent advances in applying deep learning to planning have shown that Deep Reactive Policies (DRPs) can be powerful for fast decision-making in complex environments. However, an important limitation of current DRP-based approaches is either the need of optimal planners to be used as ground truth in a supervised learning setting or the sample complexity of high-variance policy gradient estimators, which are particularly troublesome in continuous state-action domains. In order to overcome those limitations, we introduce a framework for training DRPs in continuous stochastic spaces via gradient-based policy search. The general approach is to explicitly encode a parametric policy as a deep neural network, and to formulate the probabilistic planning problem as an optimization task in a stochastic computation graph by exploiting the re-parameterization of the transition probability densities; the optimization is then solved by leveraging gradient descent algorithms that are able to handle non-convex objective functions. We benchmark our approach against stochastic planning domains exhibiting arbitrary differentiable nonlinear transition and cost functions (e. g. , Reservoir Control, HVAC and Navigation). Results show that DRPs with more than 125, 000 continuous action parameters can be optimized by our approach for problems with 30 state fluents and 30 action fluents on inexpensive hardware under 6 minutes. Also, we observed a speedup of 5 orders of magnitude in the average inference time per decision step of DRPs when compared to other state-of-the-art online gradient-based planners when the same level of solution quality is required.

PDF Details

UAI Conference 2019 Conference Paper

Epsilon-BMC: A Bayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning

Michael Gimelfarb
Scott Sanner
Chi-Guhn Lee

Resolving the exploration-exploitation trade-off remains a fundamental problem in the design and implementation of reinforcement learning (RL) algorithms. In this paper, we focus on model-free RL using the epsilon-greedy exploration policy, which despite its simplicity, remains one of the most frequently used forms of exploration. However, a key limitation of this policy is the specification of epsilon. In this paper, we provide a novel Bayesian perspective of epsilon as a measure of the uncertainty (and hence convergence) in the Q-value function. We introduce a closed-form Bayesian model update based on Bayesian model combination (BMC), based on this new perspective, which allows us to adapt epsilon using experiences from the environment in constant time with monotone convergence guarantees. We demonstrate that our proposed algorithm, epsilon-BMC, efficiently balances exploration and exploitation on different problems, performing comparably or outperforming the best tuned fixed annealing schedules and an alternative data-dependent epsilon adaptation scheme proposed in the literature.

Details

IJCAI Conference 2018 Conference Paper

Efficient Symbolic Integration for Probabilistic Inference

Samuel Kolb
Martin Mladenov
Scott Sanner
Vaishak Belle
Kristian Kersting

Weighted model integration (WMI) extends weighted model counting (WMC) to the integration of functions over mixed discrete-continuous probability spaces. It has shown tremendous promise for solving inference problems in graphical models and probabilistic programs. Yet, state-of-the-art tools for WMI are generally limited either by the range of amenable theories, or in terms of performance. To address both limitations, we propose the use of extended algebraic decision diagrams (XADDs) as a compilation language for WMI. Aside from tackling typical WMI problems, XADDs also enable partial WMI yielding parametrized solutions. To overcome the main roadblock of XADDs -- the computational cost of integration -- we formulate a novel and powerful exact symbolic dynamic programming (SDP) algorithm that seamlessly handles Boolean, integer-valued and real variables, and is able to effectively cache partial computations, unlike its predecessor. Our empirical results demonstrate that these contributions can lead to a significant computational reduction over existing probabilistic inference algorithms.

PDF Details

IJCAI Conference 2018 Conference Paper

Planning in Factored State and Action Spaces with Learned Binarized Neural Network Transition Models

Buser Say
Scott Sanner

PDF Details

NeurIPS Conference 2018 Conference Paper

Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach

Michael Gimelfarb
Scott Sanner
Chi-Guhn Lee

Potential based reward shaping is a powerful technique for accelerating convergence of reinforcement learning algorithms. Typically, such information includes an estimate of the optimal value function and is often provided by a human expert or other sources of domain knowledge. However, this information is often biased or inaccurate and can mislead many reinforcement learning algorithms. In this paper, we apply Bayesian Model Combination with multiple experts in a way that learns to trust a good combination of experts as training progresses. This approach is both computationally efficient and general, and is shown numerically to improve convergence across discrete and continuous domains and different reinforcement learning algorithms.

PDF Details

ICAPS Conference 2017 Conference Paper

Analytic Decision Analysis via Symbolic Dynamic Programming for Parameterized Hybrid MDPs

Shamin Kinathil
Harold Soh
Scott Sanner

Decision analysis w. r. t. unknown parameters is a critical task in decision-making under uncertainty. For example, we may need to (i) perform inverse learning of the cost parameters of a multi-objective reward based on observed agent behavior; (ii) perform sensitivity analyses of policies to various parameter settings; or (iii) analyze and optimize policy performance as a function of policy parameters. When such problems have mixed discrete and continuous state and/or action spaces, this leads to parameterized hybrid MDPs (PHMDPs) that are often approximately solved via discretization, sampling, and/or local gradient methods (when optimization is involved). In this paper we combine two recent advances that allow for the first exact solution and optimization of PHMDPs. We first show how each of the aforementioned use cases can be formalized as PHMDPs, which can then be solved via an extension of symbolic dynamic programming (SDP) even when the solution is piecewise nonlinear. Secondly, we can leverage recent advances in non-convex solvers that require symbolic forms of the objective function for non-convex global optimization in (i), (ii), and (iii) using SDP to derive symbolic solutions for each PHMDP formalization. We demonstrate the efficacy and scalability of our optimal analytical framework on nonlinear examples of each of the aforementioned use cases.

Details

RLDM Conference 2017 Conference Abstract

Decision-Making with Non-Markovian Rewards: From LTL to automata-based reward shaping

Alberto Camacho
Oscar Chen
Scott Sanner
Sheila McIlraith

In many decision-making settings, reward is acquired in response to some complex behaviour that an agent realizes over time. An autonomous taxi may receive reward for picking up a passenger and subsequently delivering them to their destination. An assistive robot may receive reward for ensuring a person in their care takes their medication once daily soon after eating. Such reward is acquired by an agent in response to following a path – a sequence of states that collectively capture the reward-worthy behaviour. Reward of this sort is referred to as non-Markovian reward because it is predicated on state history rather than current state. Our concern in this paper is with both the specification and effective ex- ploitation of non-Markovian reward in the context of Markov Decision Processes (MDPs). State-of-the-art UCT-based planners struggle with non-Markovian rewards because of their weak guidance and relatively myopic lookahead. Here we specify non-Markovian reward-worthy behaviour in Linear Temporal Logic. We translate these behaviours to corresponding deterministic finite state automata whose accepting condi- tions signify satisfaction of the reward-worthy behaviour. These automata accepting conditions form the basis of Markovian rewards that can be solved by off-the-shelf MDP planners, while crucially preserving policy optimality guarantees. We then explore the use of reward shaping to automatically transform these automata-based rewards into reshaped rewards that better guide search. We augmented benchmark MDP domains with non-Markovian rewards and evaluated our technique using PROST, a state-of-the-art heuristic and UCT-based MDP planner. Our experiments demonstrate significantly improved performance achieved by the exploitation of our techniques. The work presented here reflects the use of Linear Temporal Logic to specify non-Markovian reward, but our approach will work for any formal language for which there is a corresponding automata representation.

PDF Details

AAAI Conference 2017 Conference Paper

Hindsight Optimization for Hybrid State and Action MDPs

Aswin Raghavan
Scott Sanner
Roni Khardon
Prasad Tadepalli
Alan Fern

Hybrid (mixed discrete and continuous) state and action Markov Decision Processes (HSA-MDPs) provide an expressive formalism for modeling stochastic and concurrent sequential decision-making problems. Existing solvers for HSA-MDPs are either limited to very restricted transition distributions, require knowledge of domain-speciﬁc basis functions to achieve good approximations, or do not scale. We explore a domain-independent approach based on the framework of hindsight optimization (HOP) for HSA-MDPs, which uses an upper bound on the ﬁnite-horizon action values for action selection. Our main contribution is a linear time reduction to a Mixed Integer Linear Program (MILP) that encodes the HOP objective, when the dynamics are speciﬁed as location-scale probability distributions parametrized by Piecewise Linear (PWL) functions of states and actions. In addition, we show how to use the same machinery to select actions based on a lower-bound generated by straight line plans. Our empirical results show that the HSA-HOP approach effectively scales to high-dimensional problems and outperforms baselines that are capable of scaling to such large hybrid MDPs.

PDF Details

AAAI Conference 2017 Conference Paper

Low-Rank Linear Cold-Start Recommendation from Social Data

Suvash Sedhain
Aditya Menon
Scott Sanner
Lexing Xie
Darius Braziunas

The cold-start problem involves recommendation of content to new users of a system, for whom there is no historical preference information available. This proves a challenge for collaborative ﬁltering algorithms that inherently rely on such information. Recent work has shown that social metadata, such as users’ friend groups and page likes, can strongly mitigate the problem. However, such approaches either lack an interpretation as optimising some principled objective, involve iterative non-convex optimisation with limited scalability, or require tuning several hyperparameters. In this paper, we ﬁrst show how three popular cold-start models are special cases of a linear content-based model, with implicit constraints on the weights. Leveraging this insight, we propose LoCo, a new model for cold-start recommendation based on three ingredients: (a) linear regression to learn an optimal weighting of social signals for preferences, (b) a low-rank parametrisation of the weights to overcome the high dimensionality common in social data, and (c) scalable learning of such low-rank weights using randomised SVD. Experiments on four realworld datasets show that LoCo yields signiﬁcant improvements over state-of-the-art cold-start recommenders that exploit high-dimensional social network metadata.

PDF Details

SoCS Conference 2017 Conference Paper

Non-Markovian Rewards Expressed in LTL: Guiding Search Via Reward Shaping

Alberto Camacho
Oscar Chen
Scott Sanner
Sheila A. McIlraith

We propose an approach to solving Markov Decision Processes with non-Markovian rewards specified in Linear Temporal Logic interpreted over finite traces (LTL-f). Our approach integrates automata representations of LTL-f formulae into compiled MDPs that can be solved by off-the-shelf MDP planners, exploiting reward shaping to help guide search. Experiments with state-of-the-art UCT-based MDP planner PROST show automata-based reward shaping to be an effective method to guide search, producing solutions of superior quality, while maintaining policy optimality guarantees.

Details

IJCAI Conference 2017 Conference Paper

Nonlinear Hybrid Planning with Deep Net Learned Transition Models and Mixed-Integer Linear Programming

Buser Say
Ga Wu
Yu Qing Zhou
Scott Sanner

In many real-world hybrid (mixed discrete continuous) planning problems such as Reservoir Control, Heating, Ventilation and Air Conditioning (HVAC), and Navigation, it is difficult to obtain a model of the complex nonlinear dynamics that govern state evolution. However, the ubiquity of modern sensors allow us to collect large quantities of data from each of these complex systems and build accurate, nonlinear deep network models of their state transitions. But there remains one major problem for the task of control -- how can we plan with deep network learned transition models without resorting to Monte Carlo Tree Search and other black-box transition model techniques that ignore model structure and do not easily extend to mixed discrete and continuous domains? In this paper, we make the critical observation that the popular Rectified Linear Unit (ReLU) transfer function for deep networks not only allows accurate nonlinear deep net model learning, but also permits a direct compilation of the deep network transition model to a Mixed-Integer Linear Program (MILP) encoding in a planner we call Hybrid Deep MILP Planning (HD-MILP-PLAN). We identify deep net specific optimizations and a simple sparsification method for HD-MILP-PLAN that improve performance over a naive encoding, and show that we are able to plan optimally with respect to the learned deep network.

PDF Details

NeurIPS Conference 2017 Conference Paper

Scalable Planning with Tensorflow for Hybrid Nonlinear Domains

Ga Wu
Buser Say
Scott Sanner

Given recent deep learning results that demonstrate the ability to effectively optimize high-dimensional non-convex functions with gradient descent optimization on GPUs, we ask in this paper whether symbolic gradient optimization tools such as Tensorflow can be effective for planning in hybrid (mixed discrete and continuous) nonlinear domains with high dimensional state and action spaces? To this end, we demonstrate that hybrid planning with Tensorflow and RMSProp gradient descent is competitive with mixed integer linear program (MILP) based optimization on piecewise linear planning domains (where we can compute optimal solutions) and substantially outperforms state-of-the-art interior point methods for nonlinear planning domains. Furthermore, we remark that Tensorflow is highly scalable, converging to a strong plan on a large-scale concurrent domain with a total of 576, 000 continuous action parameters distributed over a horizon of 96 time steps and 100 parallel instances in only 4 minutes. We provide a number of insights that clarify such strong performance including observations that despite long horizons, RMSProp avoids both the vanishing and exploding gradient problems. Together these results suggest a new frontier for highly scalable planning in nonlinear hybrid domains by leveraging GPUs and the power of recent advances in gradient descent with highly optimized toolkits like Tensorflow.

PDF Details

IJCAI Conference 2016 Conference Paper

A Symbolic Closed-Form Solution to Sequential Market Making with Inventory

Shamin Kinathil
Scott Sanner
Sanmay Das
Nicol
aacute; s Della Penna

Market-makers serve an important role as providers of liquidity and order in financial markets, particularly during periods of high volatility. Optimal market-makers solve a sequential decision making problem, where they face an exploration versus exploitation dilemma at each time step. A belief state MDP based solution was presented by Das and Magdon-Ismail [NIPS, 2008]. This solution however, was closely tied to the choice of a Gaussian belief state prior and did not take asset inventory into consideration when calculating an optimal policy. In this work we introduce a novel continuous state POMDP framework which is the first to solve, exactly and in closed-form, the optimal market making problem with inventory, fixed asset value, arbitrary belief state priors, trader models and reward functions via symbolic dynamic programming. We use this novel model and solution to show that sequentially optimal policies are heavily inventory-dependent and calculate policies that operate with bounded loss guarantees under a variety of market models and conditions.

PDF Details

AAAI Conference 2016 Conference Paper

Closed-Form Gibbs Sampling for Graphical Models with Algebraic Constraints

Hadi Mohasel Afshar
Scott Sanner
Christfried Webers

Probabilistic inference in many real-world problems requires graphical models with deterministic algebraic constraints between random variables (e. g. , Newtonian mechanics, Pascal’s law, Ohm’s law) that are known to be problematic for many inference methods such as Monte Carlo sampling. Fortunately, when such constraints are invertible, the model can be collapsed and the constraints eliminated through the wellknown Jacobian-based change of variables. As our ﬁrst contribution in this work, we show that a much broader class of algebraic constraints can be collapsed by leveraging the properties of a Dirac delta model of deterministic constraints. Unfortunately, the collapsing process can lead to highly piecewise densities that pose challenges for existing probabilistic inference tools. Thus, our second contribution to address these challenges is to present a variation of Gibbs sampling that efﬁciently samples from these piecewise densities. The key insight to achieve this is to introduce a class of functions that (1) is sufﬁciently rich to approximate arbitrary models up to arbitrary precision, (2) is closed under dimension reduction (collapsing) for models with (non)linear algebraic constraints and (3) always permits one analytical integral suf- ﬁcient to automatically derive closed-form conditionals for Gibbs sampling. Experiments demonstrate the proposed sampler converges at least an order of magnitude faster than existing Monte Carlo samplers.

PDF Details

AAAI Conference 2016 Conference Paper

On the Effectiveness of Linear Models for One-Class Collaborative Filtering

Suvash Sedhain
Aditya Menon
Scott Sanner
Darius Braziunas

In many personalised recommendation problems, there are examples of items users prefer or like, but no examples of items they dislike. A state-of-the-art method for such implicit feedback, or one-class collaborative ﬁltering (OC-CF), problems is SLIM, which makes recommendations based on a learned item-item similarity matrix. While SLIM has been shown to perform well on implicit feedback tasks, we argue that it is hindered by two limitations: ﬁrst, it does not produce user-personalised predictions, which hampers recommendation performance; second, it involves solving a constrained optimisation problem, which impedes fast training. In this paper, we propose LRec, a variant of SLIM that overcomes these limitations without sacriﬁcing any of SLIM’s strengths. At its core, LRec employs linear logistic regression; despite this simplicity, LRec consistently and signiﬁcantly outperforms all existing methods on a range of datasets. Our results thus illustrate that the OC-CF problem can be effectively tackled via linear classiﬁcation models.

PDF Details

IJCAI Conference 2016 Conference Paper

Practical Linear Models for Large-Scale One-Class Collaborative Filtering

Suvash Sedhain
Hung Bui
Jaya Kawale
Nikos Vlassis
Branislav Kveton
Aditya Krishna Menon
Trung Bui
Scott Sanner

Collaborative filtering has emerged as the de facto approach to personalized recommendation problems. However, a scenario that has proven difficult in practice is the one-class collaborative filtering case (OC-CF), where one has examples of items that a user prefers, but no examples of items they do not prefer. In such cases, it is desirable to have recommendation algorithms that are personalized, learning-based, and highly scalable. Existing linear recommenders for OC-CF achieve good performance in benchmarking tasks, but they involve solving a large number of a regression subproblems, limiting their applicability to large-scale problems. We show that it is possible to scale up linear recommenders to big data by learning an OC-CF model in a randomized low-dimensional embedding of the user-item interaction matrix. Our algorithm, Linear-FLow, achieves state-of-the-art performance in a comprehensive set of experiments on standard benchmarks as well as real data.

PDF Details

AIJ Journal 2016 Journal Article

Real-time dynamic programming for Markov decision processes with imprecise probabilities

Karina V. Delgado
Leliane N. de Barros
Daniel B. Dias
Scott Sanner

Markov Decision Processes have become the standard model for probabilistic planning. However, when applied to many practical problems, the estimates of transition probabilities are inaccurate. This may be due to conflicting elicitations from experts or insufficient state transition information. The Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) was introduced to obtain a robust policy where there is uncertainty in the transition. Although it has been proposed a symbolic dynamic programming algorithm for MDP-IPs (called SPUDD-IP) that can solve problems up to 22 state variables, in practice, solving MDP-IP problems is time-consuming. In this paper we propose efficient algorithms for a more general class of MDP-IPs, called Stochastic Shortest Path MDP-IPs (SSP MDP-IPs) that use initial state information to solve complex problems by focusing on reachable states. The (L)RTDP-IP algorithm, a (Labeled) Real Time Dynamic Programming algorithm for SSP MDP-IPs, is proposed together with three different methods for sampling the next state. It is shown here that the convergence of (L)RTDP-IP can be obtained by using any of these three methods, although the Bellman backups for this class of problems prescribe a minimax optimization. As far as we are aware, this is the first asynchronous algorithm for SSP MDP-IPs given in terms of a general set of probability constraints that requires non-linear optimization over imprecise probabilities in the Bellman backup. Our results show up to three orders of magnitude speedup for (L)RTDP-IP when compared with the SPUDD-IP algorithm.

Details DOI

AAAI Conference 2015 Conference Paper

Bayesian Model Averaging Naive Bayes (BMA-NB): Averaging over an Exponential Number of Feature Models in Linear Time

Ga Wu
Scott Sanner
Rodrigo Oliveira

Naive Bayes (NB) is well-known to be a simple but effective classifier, especially when combined with feature selection. Unfortunately, feature selection methods are often greedy and thus cannot guarantee an optimal feature set is selected. An alternative to feature selection is to use Bayesian model averaging (BMA), which computes a weighted average over multiple predictors; when the different predictor models correspond to different feature sets, BMA has the advantage over feature selection that its predictions tend to have lower variance on average in comparison to any single model. In this paper, we show for the first time that it is possible to exactly evaluate BMA over the exponentiallysized powerset of NB feature models in linear-time in the number of features; this yields an algorithm about as expensive to train as a single NB model with all features, but yet provably converges to the globally optimal feature subset in the asymptotic limit of data. We evaluate this novel BMA-NB classifier on a range of datasets showing that it never underperforms NB (as expected) and sometimes offers performance competitive (or superior) to classifiers such as SVMs and logistic regression while taking a fraction of the time to train.

PDF Details

AAAI Conference 2015 Conference Paper

Linear-Time Gibbs Sampling in Piecewise Graphical Models

Hadi Afshar
Scott Sanner
Ehsan Abbasnejad

Many real-world Bayesian inference problems such as preference learning or trader valuation modeling in financial markets naturally use piecewise likelihoods. Unfortunately, exact closed-form inference in the underlying Bayesian graphical models is intractable in the general case and existing approximation techniques provide few guarantees on both approximation quality and efficiency. While (Markov Chain) Monte Carlo methods provide an attractive asymptotically unbiased approximation approach, rejection sampling and Metropolis-Hastings both prove inefficient in practice, and analytical derivation of Gibbs samplers require exponential space and time in the amount of data. In this work, we show how to transform problematic piecewise likelihoods into equivalent mixture models and then provide a blocked Gibbs sampling approach for this transformed model that achieves an exponential-to-linear reduction in space and time compared to a conventional Gibbs sampler. This enables fast, asymptotically unbiased Bayesian inference in a new expressive class of piecewise graphical models and empirically requires orders of magnitude less time than rejection, Metropolis-Hastings, and conventional Gibbs sampling methods to achieve the same level of accuracy.

PDF Details

AAAI Conference 2015 Conference Paper

Loss-Calibrated Monte Carlo Action Selection

Ehsan Abbasnejad
Justin Domke
Scott Sanner

Bayesian decision-theory underpins robust decisionmaking in applications ranging from plant control to robotics where hedging action selection against state uncertainty is critical for minimizing low probability but potentially catastrophic outcomes (e. g, uncontrollable plant conditions or robots falling into stairwells). Unfortunately, belief state distributions in such settings are often complex and/or high dimensional, thus prohibiting the efficient application of analytical techniques for expected utility computation when real-time control is required. This leaves Monte Carlo evaluation as one of the few viable (and hence frequently used) techniques for online action selection. However, loss-insensitive Monte Carlo methods may require large numbers of samples to identify optimal actions with high certainty since they may sample from high probability regions that do not disambiguate action utilities. In this paper we remedy this problem by deriving an optimal proposal distribution for a loss-calibrated Monte Carlo importance sampler that bounds the regret of using an estimated optimal action. Empirically, we show that using our loss-calibrated Monte Carlo method yields high-accuracy optimal action selections in a fraction of the number of samples required by conventional loss-insensitive samplers.

PDF Details

AAAI Conference 2015 Conference Paper

Real-Time Symbolic Dynamic Programming

Luis Vianna
Leliane de Barros
Scott Sanner

Recent advances in Symbolic Dynamic Programming (SDP) combined with the extended algebraic decision diagram (XADD) have provided exact solutions for expressive subclasses of ﬁnite-horizon Hybrid Markov Decision Processes (HMDPs) with mixed continuous and discrete state and action parameters. Unfortunately, SDP suffers from two major drawbacks: (1) it solves for all states and can be intractable for many problems that inherently have large optimal XADD value function representations; and (2) it cannot maintain compact (pruned) XADD representations for domains with nonlinear dynamics and reward due to the need for nonlinear constraint checking. In this work, we simultaneously address both of these problems by introducing real-time SDP (RTSDP). RTSDP addresses (1) by focusing the solution and value representation only on regions reachable from a set of initial states and RTSDP addresses (2) by using visited states as witnesses of reachable regions to assist in pruning irrelevant or unreachable (nonlinear) regions of the value function. To this end, RTSDP enjoys provable convergence over the set of initial states and substantial space and time savings over SDP as we demonstrate in a variety of hybrid domains ranging from inventory to reservoir to trafﬁc control.

PDF Details

UAI Conference 2014 Conference Paper

Closed-form Solutions to a Subclass of Continuous Stochastic Games via Symbolic Dynamic Programming

Shamin Kinathil
Scott Sanner
Nicolás Della Penna

Zero-sum stochastic games provide a formalism to study competitive sequential interactions between two agents with diametrically opposing goals and evolving state. A solution to such games with discrete state was presented by Littman (Littman, 1994). The continuous state version of this game remains unsolved. In many instances continuous state solutions require nonlinear optimisation, a problem for which closedform solutions are generally unavailable. We present an exact closed-form solution to a subclass of zero-sum continuous stochastic games that can be solved as a parameterised linear program by utilising symbolic dynamic programming. This novel technique is applied to calculate exact solutions to a variety of zero-sum continuous state stochastic games.

Details

UAI Conference 2014 Conference Paper

Sequential Bayesian Optimisation for Spatial-Temporal Monitoring

Román Marchant
Fabio Ramos 0001
Scott Sanner

Bayesian Optimisation has received considerable attention in recent years as a general methodology to find the maximum of costly-to-evaluate objective functions. Most existing BO work focuses on where to gather a set of samples without giving special consideration to the sampling sequence, or the costs or constraints associated with that sequence. However, in real-world sequential decision problems such as robotics, the order in which samples are gathered is paramount, especially when the robot needs to optimise a temporally non-stationary objective function. Additionally, the state of the environment and sensing platform determine the type and cost of samples that can be gathered. To address these issues, we formulate Sequential Bayesian Optimisation (SBO) with side-state information within a Partially Observed Markov Decision Process (POMDP) framework that can accommodate discrete and continuous observation spaces. We build on previous work using Monte-Carlo Tree Search (MCTS) and Upper Confidence bound for Trees (UCT) for POMDPs and extend it to work with continuous state and observation spaces. Through a series of experiments on monitoring a spatial-temporal process with a mobile robot, we show that our UCTbased SBO POMDP optimisation outperforms myopic and non-myopic alternatives.

Details

ICML Conference 2013 Conference Paper

Algorithms for Direct 0-1 Loss Optimization in Binary Classification

Tan Nguyen
Scott Sanner

While convex losses for binary classification are attractive due to the existence of numerous (provably) efficient methods for finding their global optima, they are sensitive to outliers. On the other hand, while the non-convex 0–1 loss is robust to outliers, it is NP-hard to optimize and thus rarely directly optimized in practice. In this paper, however, we do just that: we explore a variety of practical methods for direct (approximate) optimization of the 0–1 loss based on branch and bound search, combinatorial search, and coordinate descent on smooth, differentiable relaxations of 0–1 loss. Empirically, we compare our proposed algorithms to logistic regression, SVM, and the Bayes point machine showing that the proposed 0–1 loss optimization algorithms perform at least as well and offer a clear advantage in the presence of outliers. To this end, we believe this work reiterates the importance of 0–1 loss and its robustness properties while challenging the notion that it is difficult to directly optimize.

Details

UAI Conference 2013 Conference Paper

Bounded Approximate Symbolic Dynamic Programming for Hybrid MDPs

Luis Gustavo Vianna
Scott Sanner
Leliane Nunes de Barros

Recent advances in symbolic dynamic programming (SDP) combined with the extended algebraic decision diagram (XADD) data structure have provided exact solutions for mixed discrete and continuous (hybrid) MDPs with piecewise linear dynamics and continuous actions. Since XADD-based exact solutions may grow intractably large for many problems, we propose a bounded error compression technique for XADDs that involves the solution of a constrained bilinear saddle point problem. Fortuitously, we show that given the special structure of this problem, it can be expressed as a bilevel linear programming problem and solved to optimality in finite time via constraint generation, despite having an infinite set of constraints. This solution permits the use of efficient linear program solvers for XADD compression and enables a novel class of bounded approximate SDP algorithms for hybrid MDPs that empirically offers order-ofmagnitude speedups over the exact solution in exchange for a small approximation error.

Details

IJCAI Conference 2013 Conference Paper

Learning Community-Based Preferences via Dirichlet Process Mixtures of Gaussian Processes

Ehsan Abbasnejad
Scott Sanner
Edwin V. Bonilla
Pascal Poupart

Bayesian approaches to preference learning using Gaussian Processes (GPs) are attractive due to their ability to explicitly model uncertainty in users’ latent utility functions; unfortunately existing techniques have cubic time complexity in the number of users, which renders this approach intractable for collaborative preference learning over a large user base. Exploiting the observation that user populations often decompose into communities of shared preferences, we model user preferences as an inﬁnite Dirichlet Process (DP) mixture of communities and learn (a) the expected number of preference communities represented in the data, (b) a GPbased preference model over items tailored to each community, and (c) the mixture weights representing each user’s fraction of community membership. This results in a learning and inference process that scales linearly in the number of users rather than cubicly and additionally provides the ability to analyze individual community preferences and their associated members. We evaluate our approach on a variety of preference data sources including Amazon Mechanical Turk showing that our method is more scalable and as accurate as previous GP-based preference learning work.

PDF Details DOI

EWRL Workshop 2013 Workshop Paper

Recent Advances in Symbolic Dynamic Programming for Hybrid MDPs and POMDPs

Scott Sanner
Zahra Zamani

Many real-world decision-theoretic planning problems are naturally modeled using mixed discrete and continuous state, action, and observation spaces, yet little work has provided exact methods for performing exact dynamic programming backups in such problems. This overview talk will survey a number of recent developments in the exact and approximate solution of mixed discrete and continuous (hybrid) MDPs and POMDPs via the technique of symbolic dynamic programming (SDP).

PDF Details

IJCAI Conference 2013 Conference Paper

Robust Optimization for Hybrid MDPs with State-Dependent Noise

Zahra Zamani
Scott Sanner
Karina Valdivia Delgado
Leliane Nunes de Barros

Recent advances in solutions to Hybrid MDPs with discrete and continuous state and action spaces have signiﬁcantly extended the class of MDPs for which exact solutions can be derived, albeit at the expense of a restricted transition noise model. In this paper, we work around limitations of previous solutions by adopting a robust optimization approach in which Nature is allowed to adversarially determine transition noise within pre-speciﬁed con- ﬁdence intervals. This allows one to derive an optimal policy with an arbitrary (user-speciﬁed) level of success probability and signiﬁcantly extends the class of transition noise models for which Hybrid MDPs can be solved. This work also signiﬁcantly extends results for the related “chance-constrained” approach in stochastic hybrid control to accommodate state-dependent noise. We demonstrate our approach working on a variety of hybrid MDPs taken from AI planning, operations research, and control theory, noting that this is the ﬁrst time robust solutions with strong guarantees over all states have been automatically derived for such problems.

PDF Details DOI

AAAI Conference 2012 Conference Paper

Symbolic Dynamic Programming for Continuous State and Action MDPs

Zahra Zamani
Scott Sanner
Cheng Fang

Many real-world decision-theoretic planning problems are naturally modeled using both continuous state and action (CSA) spaces, yet little work has provided exact solutions for the case of continuous actions. In this work, we propose a symbolic dynamic programming (SDP) solution to obtain the optimal closed-form value function and policy for CSA-MDPs with multivariate continuous state and actions, discrete noise, piecewise linear dynamics, and piecewise linear (or restricted piecewise quadratic) reward. Our key contribution over previous SDP work is to show how the continuous action maximization step in the dynamic programming backup can be evaluated optimally and symbolically — a task which amounts to symbolic constrained optimization subject to unknown state parameters; we further integrate this technique to work with an efficient and compact data structure for SDP — the extended algebraic decision diagram (XADD). We demonstrate empirical results on a didactic nonlinear planning example and two domains from operations research to show the first automated exact solution to these problems.

PDF Details

NeurIPS Conference 2012 Conference Paper

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

Zahra Zamani
Scott Sanner
Pascal Poupart
Kristian Kersting

Partially-observable Markov decision processes (POMDPs) provide a powerful model for real-world sequential decision-making problems. In recent years, point- based value iteration methods have proven to be extremely effective techniques for ﬁnding (approximately) optimal dynamic programming solutions to POMDPs when an initial set of belief states is known. However, no point-based work has provided exact point-based backups for both continuous state and observation spaces, which we tackle in this paper. Our key insight is that while there may be an inﬁnite number of possible observations, there are only a ﬁnite number of observation partitionings that are relevant for optimal decision-making when a ﬁnite, ﬁxed set of reachable belief states is known. To this end, we make two important contributions: (1) we show how previous exact symbolic dynamic pro- gramming solutions for continuous state MDPs can be generalized to continu- ous state POMDPs with discrete observations, and (2) we show how this solution can be further extended via recently developed symbolic methods to continuous state and observations to derive the minimal relevant observation partitioning for potentially correlated, multivariate observation spaces. We demonstrate proof-of- concept results on uni- and multi-variate state and observation steam plant control.

PDF Details

AAAI Conference 2012 Conference Paper

Symbolic Variable Elimination for Discrete and Continuous Graphical Models

Scott Sanner
Ehsan Abbasnejad

Probabilistic reasoning in the real-world often requires inference in continuous variable graphical models, yet there are few methods for exact, closed-form inference when joint distributions are non-Gaussian. To address this inferential deficit, we introduce SVE – a symbolic extension of the well-known variable elimination algorithm to perform exact inference in an expressive class of mixed discrete and continuous variable graphical models whose conditional probability functions can be well-approximated as oblique piecewise polynomials with bounded support. Using this representation, we show that we can compute all of the SVE operations exactly and in closed-form, which crucially includes definite integration w. r. t. multivariate piecewise polynomial functions. To aid in the efficient computation and compact representation of this solution, we use an extended algebraic decision diagram (XADD) data structure that supports all SVE operations. We provide illustrative results for SVE on probabilistic inference queries inspired by robotics localization and tracking applications that mix various continuous distributions; this represents the first time a general closed-form exact solution has been proposed for this expressive class of discrete/continuous graphical models.

PDF Details

AIJ Journal 2011 Journal Article

Efficient solutions to factored MDPs with imprecise transition probabilities

Karina Valdivia Delgado
Scott Sanner
Leliane Nunes de Barros

When modeling real-world decision-theoretic planning problems in the Markov Decision Process (MDP) framework, it is often impossible to obtain a completely accurate estimate of transition probabilities. For example, natural uncertainty arises in the transition specification due to elicitation of MDP transition models from an expert or estimation from data, or non-stationary transition distributions arising from insufficient state knowledge. In the interest of obtaining the most robust policy under transition uncertainty, the Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) has been introduced to model such scenarios. Unfortunately, while various solution algorithms exist for MDP-IPs, they often require external calls to optimization routines and thus can be extremely time-consuming in practice. To address this deficiency, we introduce the factored MDP-IP and propose efficient dynamic programming methods to exploit its structure. Noting that the key computational bottleneck in the solution of factored MDP-IPs is the need to repeatedly solve nonlinear constrained optimization problems, we show how to target approximation techniques to drastically reduce the computational overhead of the nonlinear solver while producing bounded, approximately optimal solutions. Our results show up to two orders of magnitude speedup in comparison to traditional “flat” dynamic programming approaches and up to an order of magnitude speedup over the extension of factored MDP approximate value iteration techniques to MDP-IPs while producing the lowest error of any approximation algorithm evaluated.

Details DOI

IJCAI Conference 2011 Conference Paper

Multi-Evidence Lifted Message Passing, with Application to PageRank and the Kalman Filter

Babak Ahmadi
Kristian Kersting
Scott Sanner

Lifted message passing algorithms exploit repeated structure within a given graphical model to answer queries efficiently. Given evidence, they construct a lifted network of supernodes and superpotentials corresponding to sets of nodes and potentials that are indistinguishable given the evidence. Recently, efficient algorithms were presented for updating the structure of an existing lifted network with incremental changes to the evidence. In the inference stage, however, current algorithms need to construct a separate lifted network for each evidence case and run a modified message passing algorithm on each lifted network separately. Consequently, symmetries across the inference tasks are not exploited. In this paper, we present a novel lifted message passing technique that exploits symmetries across multiple evidence cases. The benefits of this multi-evidence lifted inference are shown for several important AI tasks such as computing personalized PageRanks and Kalman filters via multi-evidence lifted Gaussian belief propagation.

PDF Details DOI

UAI Conference 2011 Conference Paper

Symbolic Dynamic Programming for Discrete and Continuous State MDPs

Scott Sanner
Karina Valdivia Delgado
Leliane Nunes de Barros

Details

AAMAS Conference 2010 Conference Paper

Approximate Dynamic Programming with Affine ADDs

Scott Sanner
William Uther
Karina Valdivia Delgado

The Affine ADD (AADD) is an extension of the Algebraic Decision Diagram (ADD) that compactly represents context-specific, additive and multiplicative structure in functions from a discretedomain to a real-valued range. In this paper, we introduce a novelalgorithm for efficiently finding AADD approximations that we useto develop the MADCAP algorithm for AADD-based structuredapproximate dynamic programming (ADP) with factored MDPs. MADCAP requires less time and space to achieve comparable orbetter approximate solutions than the current state-of-the-art ADD-based ADP algorithm of APRICODD and can provide approximatesolutions for problems with context-specific, additive and multiplicative structure on which APRICODD runs out of memory.

PDF

NeurIPS Conference 2010 Conference Paper

Gaussian Process Preference Elicitation

Shengbo Guo
Scott Sanner
Edwin Bonilla

Bayesian approaches to preference elicitation (PE) are particularly attractive due to their ability to explicitly model uncertainty in users' latent utility functions. However, previous approaches to Bayesian PE have ignored the important problem of generalizing from previous users to an unseen user in order to reduce the elicitation burden on new users. In this paper, we address this deficiency by introducing a Gaussian Process (GP) prior over users' latent utility functions on the joint space of user and item features. We learn the hyper-parameters of this GP on a set of preferences of previous users and use it to aid in the elicitation process for a new user. This approach provides a flexible model of a multi-user utility function, facilitates an efficient value of information (VOI) heuristic query selection strategy, and provides a principled way to incorporate the elicitations of multiple users back into the model. We show the effectiveness of our method in comparison to previous work on a real dataset of user preferences over sushi types.

PDF Details

AAAI Conference 2010 Conference Paper

Symbolic Dynamic Programming for First-order POMDPs

Scott Sanner
Kristian Kersting

Partially-observable Markov decision processes (POMDPs) provide a powerful model for sequential decision-making problems with partially-observed state and are known to have (approximately) optimal dynamic programming solutions. Much work in recent years has focused on improving the efficiency of these dynamic programming algorithms by exploiting symmetries and factored or relational representations. In this work, we show that it is also possible to exploit the full expressive power of first-order quantification to achieve state, action, and observation abstraction in a dynamic programming solution to relationally specified POMDPs. Among the advantages of this approach are the ability to maintain compact value function representations, abstract over the space of potentially optimal actions, and automatically derive compact conditional policy trees that minimally partition relational observation spaces according to distinctions that have an impact on policy values. This is the first lifted relational POMDP solution that can optimally accommodate actions with a potentially infinite relational space of observation outcomes.

PDF Details

ICML Conference 2010 Conference Paper

Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda

Carlton Downey
Scott Sanner

Details

IJCAI Conference 2009 Conference Paper

Scott Sanner
Robby Goetschalckx
Kurt Driessens
Guy Shani

Real-time dynamic programming (RTDP) solves Markov decision processes (MDPs) when the initial state is restricted, by focusing dynamic programming on the envelope of states reachable from an initial state set. RTDP often provides performance guarantees without visiting the entire state space. Building on RTDP, recent work has sought to improve its efﬁciency through various optimizations, including maintaining upper and lower bounds to both govern trial termination and prioritize state exploration. In this work, we take a Bayesian perspective on these upper and lower bounds and use a value of perfect information (VPI) analysis to govern trial termination and exploration in a novel algorithm we call VPI-RTDP. VPI-RTDP leads to an improvement over state-of-the-art RTDP methods, empirically yielding up to a three-fold reduction in the amount of time and number of visited states required to achieve comparable policy performance.

PDF Details

ICAPS Conference 2009 Conference Paper

Efficient Solutions to Factored MDPs with Imprecise Transition Probabilities

Karina Valdivia Delgado
Scott Sanner
Leliane Nunes de Barros
Fábio Gagliardi Cozman

When modeling real-world decision-theoretic planning problems in the Markov decision process (MDP) framework, it is often impossible to obtain a completely accurate estimate of transition probabilities. For example, natural uncertainty arises in the transition specification due to elicitation of MDP transition models from an expert or data, or non-stationary transition distributions arising from insufficient state knowledge. In the interest of obtaining the most robust policy under transition uncertainty, the Markov Decision Process with Imprecise Transition Probabilities (MDP-IPs) has been introduced to model such scenarios. Unfortunately, while solutions to the MDP-IP are well-known, they require nonlinear optimization and are extremely time-consuming in practice. To address this deficiency, we propose efficient dynamic programming methods to exploit the structure of factored MDPIPs. Noting that the key computational bottleneck in the solution of MDP-IPs is the need to repeatedly solve nonlinear constrained optimization problems, we show how to target approximation techniques to drastically reduce the computational overhead of the nonlinear solver while producing bounded, approximately optimal solutions. Our results show up to two orders of magnitude speedup in comparison to traditional “flat” dynamic programming approaches and up to an order of magnitude speedup over the extension of factored MDP approximate value iteration techniques to MDP-IPs.

Details

AIJ Journal 2009 Journal Article

Practical solution techniques for first-order MDPs

Scott Sanner
Craig Boutilier

Many traditional solution approaches to relationally specified decision-theoretic planning problems (e. g. , those stated in the probabilistic planning domain description language, or PPDDL) ground the specification with respect to a specific instantiation of domain objects and apply a solution approach directly to the resulting ground Markov decision process (MDP). Unfortunately, the space and time complexity of these grounded solution approaches are polynomial in the number of domain objects and exponential in the predicate arity and the number of nested quantifiers in the relational problem specification. An alternative to grounding a relational planning problem is to tackle the problem directly at the relational level. In this article, we propose one such approach that translates an expressive subset of the PPDDL representation to a first-order MDP (FOMDP) specification and then derives a domain-independent policy without grounding at any intermediate step. However, such generality does not come without its own set of challenges—the purpose of this article is to explore practical solution techniques for solving FOMDPs. To demonstrate the applicability of our techniques, we present proof-of-concept results of our first-order approximate linear programming (FOALP) planner on problems from the probabilistic track of the ICAPS 2004 and 2006 International Planning Competitions.

Details DOI

ECAI Conference 2008 Conference Paper

Reinforcement Learning with the Use of Costly Features

Robby Goetschalckx
Scott Sanner
Kurt Driessens

A common solution approach to reinforcement learning problems with large state spaces (where value functions cannot be represented exactly) is to compute an approximation of the value function in terms of state features. However, little attention has been paid to the cost of computing these state features (e. g. , search-based features). To this end, we introduce a cost-sensitive sparse linear-value function approximation algorithm - FOVEA - and demonstrate its performance on an experimental domain with a range of feature costs.

Details

EWRL Workshop 2008 Conference Paper

Reinforcement Learning with the Use of Costly Features

Robby Goetschalckx
Scott Sanner
Kurt Driessens

Abstract In many practical reinforcement learning problems, the state space is too large to permit an exact representation of the value function, much less the time required to compute it. In such cases, a common solution approach is to compute an approximation of the value function in terms of state features. However, relatively little attention has been paid to the cost of computing these state features. For example, search-based features may be useful for value prediction, but their computational cost must be traded off with their impact on value accuracy. To this end, we introduce a new cost-sensitive sparse linear regression paradigm for value function approximation in reinforcement learning where the learner is able to select only those costly features that are sufficiently informative to justify their computation. We illustrate the learning behavior of our approach using a simple experimental domain that allows us to explore the effects of a range of costs on the cost-performance trade-off.

Details

ICAPS Conference 2007 Conference Paper

Approximate Solution Techniques for Factored First-Order MDPs

Scott Sanner
Craig Boutilier

Most traditional approaches to probabilistic planning in relationally specified MDPs rely on grounding the problem w. r. t. specific domain instantiations, thereby incurring a combinatorial blowup in the representation. An alternative approach is to lift a relational MDP to a first-order MDP (FOMDP) specification and develop solution approaches that avoid grounding. Unfortunately, state-of-the-art FOMDPs are inadequate for specifying factored transition models or additive rewards that scale with the domain size -- structure that is very natural in probabilistic planning problems. To remedy these deficiencies, we propose an extension of the FOMDP formalism known as a factored FOMDP and present generalizations of symbolic dynamic programming and linear-value approximation solutions to exploit its structure. Along the way, we also make contributions to the field of first-order probabilistic inference (FOPI) by demonstrating novel first-order structures that can be exploited without domain grounding. We present empirical results to demonstrate that we can obtain solutions whose complexity scales polynomially in the logarithm of the domain size -- results that are impossible to obtain with any previously proposed solution method.

Details

KR Conference 2006 Conference Paper

An Ordered Theory Resolution Calculus for Hybrid Reasoning in First-order Extensions of DLs

Scott Sanner
Sheila McIlraith

Systems for hybrid reasoning with first-order logic (FOL) extensions of description logic (DL) date back at least 20 years and are enjoying a renewed interest in the context of recent FOL extensions of OWL DL for the Semantic Web. However, current systems for reasoning with such languages can only handle subsets of FOL or they do not fully exploit recent advances in both FOL theorem proving and DL inference. In response, we present an ordered theory resolution calculus for hybrid reasoning in unrestricted FOL extensions of the DL SHI. This calculus permits near-seamless integration of highly optimized FOL theorem provers and DL reasoners while minimizing redundant reasoning and maintaining soundness and refutational completeness. Empirical results demonstrate the potential of this approach in comparison to the state-of-the-art FOL theorem provers Vampire, Otter, and SPASS.

UAI Conference 2006 Conference Paper

Practical Linear Value-approximation Techniques for First-order MDPs

Scott Sanner
Craig Boutilier

Recent work on approximate linear programming (ALP) techniques for first-order Markov Decision Processes (FOMDPs) represents the value function linearly w.r.t. a set of first-order basis functions and uses linear programming techniques to determine suitable weights. This approach offers the advantage that it does not require simplification of the first-order value function, and allows one to solve FOMDPs independent of a specific domain instantiation. In this paper, we address several questions to enhance the applicability of this work: (1) Can we extend the first-order ALP framework to approximate policy iteration to address performance deficiencies of previous approaches? (2) Can we automatically generate basis functions and evaluate their impact on value function quality? (3) How can we decompose intractable problems with universally quantified rewards into tractable subproblems? We propose answers to these questions along with a number of novel optimizations and provide a comparative empirical evaluation on logistics problems from the ICAPS 2004 Probabilistic Planning Competition.

Details

IJCAI Conference 2005 Conference Paper

Affine Algebraic Decision Diagrams (AADDs) and their Application to Structured Probabilistic Inference

Scott Sanner
David

We propose an affine extension to ADDs (AADD) capable of compactly representing context-specific, additive, and multiplicative structure. We show that the AADD has worstcase time and space performance within a multiplicative constant of that of ADDs, but that it can be linear in the number of variables in cases where ADDs are exponential in the number of variables. We provide an empirical comparison of tabular, ADD, and AADD representations used in standard Bayes net and MDP inference algorithms and conclude that the AADD performs at least as well as the other two representations, and often yields an exponential performance improvement over both when additive or multiplicative structure can be exploited. These results suggest that the AADD is likely to yield exponential time and space improvements for a variety of probabilistic inference algorithms that currently use tables or ADDs.

PDF

UAI Conference 2005 Conference Paper

Approximate Linear Programming for First-order MDPs

Scott Sanner
Craig Boutilier

We introduce a new approximate solution technique for first-order Markov decision processes (FOMDPs). Representing the value function linearly w.r.t. a set of first-order basis functions, we compute suitable weights by casting the corresponding optimization as a first-order linear program and show how off-the-shelf theorem prover and LP software can be effectively used. This technique allows one to solve FOMDPs independent of a specific domain instantiation; furthermore, it allows one to determine bounds on approximation error that apply equally to all domain instantiations. We apply this solution technique to the task of elevator scheduling with a rich feature space and multi-criteria additive reward, and demonstrate that it outperforms a number of intuitive, heuristicallyguided policies.

Details

IROS Conference 2002 Conference Paper

Towards object mapping in non-stationary environments with mobile robots

Rahul Biswas
Benson Limketkai
Scott Sanner
Sebastian Thrun

We propose an occupancy grid mapping algorithm for mobile robots operating in environments where objects change their locations over time. Our approach uses a straightforward map differencing technique to detect changes in an environment over time. It employs the expectation maximization algorithm to learn models of non-stationary objects, and to determine the location of such objects in individual occupancy grid maps built at different points in time. By combining data from multiple maps when learning object models, the resulting models have higher fidelity than could be obtained from any single map. A Bayesian complexity measure is applied to determine the number of different objects in the model, making it possible to apply the approach to situations where not all objects are present at all times in the map.

Details

ICML Conference 2000 Conference Paper

Achieving Efficient and Cognitively Plausible Learning in Backgammon

Scott Sanner
John R. Anderson
Christian Lebiere
Marsha C. Lovett

Details