Author name cluster

Georgios Theocharous

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2024 Conference Paper

Distributional Off-Policy Evaluation for Slate Recommendations

Shreyas Chaudhari
David Arbour
Georgios Theocharous
Nikos Vlassis

Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.

PDF Details DOI

ICLR Conference 2023 Conference Paper

Explaining RL Decisions with Trajectories

Shripad Vilasrao Deshmukh
Arpan Dasgupta
Balaji Krishnamurthy
Nan Jiang
Chirag Agarwal
Georgios Theocharous
Jayakumar Subramanian

Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy.

Details

AAAI Conference 2023 Conference Paper

Smoothed Online Combinatorial Optimization Using Imperfect Predictions

Kai Wang
Zhao Song
Georgios Theocharous
Sridhar Mahadevan

Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds. We study smoothed online combinatorial optimization problems when an imperfect predictive model is available, where the model can forecast the future cost functions with uncertainty. We show that using predictions to plan for a finite time horizon leads to regret dependent on the total predictive uncertainty and an additional switching cost. This observation suggests choosing a suitable planning window to balance between uncertainty and switching cost, which leads to an online algorithm with guarantees on the upper and lower bounds of the cumulative regret. Empirically, our algorithm shows a significant improvement in cumulative regret compared to other baselines in synthetic online distributed streaming problems.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Constraint Sampling Reinforcement Learning: Incorporating Expertise for Faster Learning

Tong Mu
Georgios Theocharous
David Arbour
Emma Brunskill

Online reinforcement learning (RL) algorithms are often difficult to deploy in complex human-facing applications as they may learn slowly and have poor early performance. To address this, we introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. It takes in multiple potential policy constraints to maintain robustness to misspecification of individual constraints while leveraging helpful ones to learn quickly. Given a base RL learning algorithm (ex. UCRL, DQN, Rainbow) we propose an upper confidence with elimination scheme that leverages the relationship between the constraints, and their observed performance, to adaptively switch among them. We instantiate our algorithm with DQN-type algorithms and UCRL as base algorithms, and evaluate our algorithm in four environments, including three simulators based on real data: recommendations, educational activity sequencing, and HIV treatment sequencing. In all cases, CSRL learns a good policy faster than baselines.

PDF Details

ICML Conference 2021 Conference Paper

High Confidence Generalization for Reinforcement Learning

James E. Kostas
Yash Chandak
Scott M. Jordan
Georgios Theocharous
Philip S. Thomas

We present several classes of reinforcement learning algorithms that safely generalize to Markov decision processes (MDPs) not seen during training. Specifically, we study the setting in which some set of MDPs is accessible for training. The goal is to generalize safely to MDPs that are sampled from the same distribution, but which may not be in the set accessible for training. For various definitions of safety, our algorithms give probabilistic guarantees that agents can safely generalize to MDPs that are sampled from the same distribution but are not necessarily in the training set. These algorithms are a type of Seldonian algorithm (Thomas et al. , 2019), which is a class of machine learning algorithms that return models with probabilistic safety guarantees for user-specified definitions of safety.

Details

AAAI Conference 2020 Conference Paper

Lifelong Learning with a Changing Action Set

Yash Chandak
Georgios Theocharous
Chris Nota
Philip Thomas

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the size of the action set changes remains unaddressed. In this paper, we present ﬁrst steps towards developing an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efﬁciency of this approach on large-scale real-world lifelong learning problems.

PDF Details

ICML Conference 2020 Conference Paper

Optimizing for the Future in Non-Stationary MDPs

Yash Chandak
Georgios Theocharous
Shiv Shankar
Martha White
Sridhar Mahadevan
Philip S. Thomas

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.

Details

AAAI Conference 2020 Conference Paper

Reinforcement Learning When All Actions Are Not Always Available

Yash Chandak
Georgios Theocharous
Blossom Metevier
Philip Thomas

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not efﬁciently capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS- MDP) formulation has been proposed, which better captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs can suffer from potential divergence issues, and present new policy gradient algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches on tasks inspired by real-life use cases wherein the action set is stochastic.

PDF Details

NeurIPS Conference 2020 Conference Paper

Towards Safe Policy Improvement for Non-Stationary MDPs

Yash Chandak
Scott Jordan
Georgios Theocharous
Martha White
Philip S. Thomas

Many real-world sequential decision-making problems involve critical systems with financial risks and human-life risks. While several works in the past have proposed methods that are safe for deployment, they assume that the underlying problem is stationary. However, many real-world problems of interest exhibit non-stationarity, and when stakes are high, the cost associated with a false stationarity assumption may be unacceptable. We take the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. Our proposed method extends a type of safe algorithm, called a Seldonian algorithm, through a synthesis of model-free reinforcement learning with time-series analysis. Safety is ensured using sequential hypothesis testing of a policy’s forecasted performance, and confidence intervals are obtained using wild bootstrap.

PDF Details

ICML Conference 2019 Conference Paper

Learning Action Representations for Reinforcement Learning

Yash Chandak
Georgios Theocharous
James E. Kostas
Scott M. Jordan
Philip S. Thomas

Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori. We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions. These representations improve generalization over large, finite action sets by allowing the agent to infer the outcomes of actions similar to actions already taken. We provide an algorithm to both learn and use action representations and provide conditions for its convergence. The efficacy of the proposed method is demonstrated on large-scale real-world problems.

Details

AAMAS Conference 2018 Conference Paper

Capacity-aware Sequential Recommendations

Frits de Nijs
Georgios Theocharous
Nikos Vlassis
Mathijs M. de Weerdt
Matthijs T. J. Spaan

Personalized recommendations are increasingly important to engage users and guide them through large systems, for example when recommending points of interest to tourists visiting a popular city. To maximize long-term user experience, the system should consider issuing recommendations sequentially, since by observing the user’s response to a recommendation, the system can update its estimate of the user’s (latent) interests. However, as traditional recommender systems target individuals, their effect on a collective of users can unintentionally overload capacity. Therefore, recommender systems should not only consider the users’ interests, but also the effect of recommendations on the available capacity. The structure in such a constrained, multi-agent, partially observable decision problem can be exploited by a novel belief-space sampling algorithm which bounds the size of the state space by a limit on regret. By exploiting the stationary structure of the problem, our algorithm is significantly more scalable than existing approximate solvers. Moreover, by explicitly considering the information value of actions, this algorithm significantly improves the quality of recommendations over an extension of posterior sampling reinforcement learning to the constrained multi-agent case. We show how to decouple constraint satisfaction from sequential recommendation policies, resulting in algorithms which issue recommendations to thousands of agents while respecting constraints.

PDF

NeurIPS Conference 2018 Conference Paper

Scalar Posterior Sampling with Applications

Georgios Theocharous
Zheng Wen
Yasin Abbasi Yadkori
Nikos Vlassis

We propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-art PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parameterization for a large class of problems in sequential recommendations.

PDF Details

ICML Conference 2015 Conference Paper

High Confidence Policy Improvement

Philip S. Thomas
Georgios Theocharous
Mohammad Ghavamzadeh

We present a batch reinforcement learning (RL) algorithm that provides probabilistic guarantees about the quality of each policy that it proposes, and which has no hyper-parameter that requires expert tuning. Specifically, the user may select any performance lower-bound and confidence level and our algorithm will ensure that the probability that it returns a policy with performance below the lower bound is at most the specified confidence level. We then propose an incremental algorithm that executes our policy improvement algorithm repeatedly to generate multiple policy improvements. We show the viability of our approach with a simple 4 x 4 gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.

Details

AAAI Conference 2015 Conference Paper

High-Confidence Off-Policy Evaluation

Philip Thomas
Georgios Theocharous
Mohammad Ghavamzadeh

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

PDF Details

IJCAI Conference 2015 Conference Paper

Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees

Georgios Theocharous
Philip S. Thomas
Mohammad Ghavamzadeh

In this paper, we propose a framework for using reinforcement learning (RL) algorithms to learn good policies for personalized ad recommendation (PAR) systems. The RL algorithms take into account the long-term effect of an action, and thus, could be more suitable than myopic techniques like supervised learning and contextual bandit, for modern PAR systems in which the number of returning visitors is rapidly growing. However, while myopic techniques have been well-studied in PAR systems, the RL approach is still in its infancy, mainly due to two fundamental challenges: how to compute a good RL strategy and how to evaluate a solution using historical data to ensure its “safety” before deployment. In this paper, we propose to use a family of off-policy evaluation techniques with statistical guarantees to tackle both these challenges. We apply these methods to a real PAR problem, both for evaluating the final performance and for optimizing the parameters of the RL algorithm. Our results show that a RL algorithm equipped with these offpolicy evaluation techniques outperforms the myopic approaches. Our results also give fundamental insights on the difference between the click through rate (CTR) and life-time value (LTV) metrics for evaluating the performance of a PAR algorithm.

PDF Details

NeurIPS Conference 2015 Conference Paper

Policy Evaluation Using the Ω-Return

Philip Thomas
Scott Niekum
Georgios Theocharous
George Konidaris

We propose the Ω-return as an alternative to the λ-return currently used by the TD(λ) family of algorithms. The benefit of the Ω-return is that it accounts for the correlation of different length returns. Because it is difficult to compute exactly, we suggest one way of approximating the Ω-return. We provide empirical studies that suggest that it is superior to the λ-return and γ-return for a variety of problems.

PDF Details

RLDM Conference 2013 Conference Abstract

Lifetime Value Marketing using Reinforcement Learning

Georgios Theocharous
Assaf Hallak

In many marketing applications, companies use technology for interacting with their customers and making product or services recommendations. Today, these marketing decisions are mainly made in a myopic (best opportunity right now) approach and optimize short-term gains. In our research we are exploring new ways of marketing interactions for optimizing Life-Time Value (LTV). In particular, we are exploring marketing recommendations through Reinforcement Learning (RL) and Markov Decision Processes (MDPs). In this paper we compute the LTV policies for several real world data sets using various state of the art reinforcement learning algorithms. In addition, we propose an offline evaluation method for these methods using a well-crafted simulator, according to which LTV policies outperform myopic policies. Finally, we characterize the error of the estimated value of the policies on the simulator, using the simulator’s prediction errors.

PDF Details

AAAI Conference 2013 Conference Paper

Structured Kernel-Based Reinforcement Learning

Branislav Kveton
Georgios Theocharous

Kernel-based reinforcement learning (KBRL) is a popular approach to learning non-parametric value function approximations. In this paper, we present structured KBRL, a paradigm for kernel-based RL that allows for modeling independencies in the transition and reward models of problems. Real-world problems often exhibit this structure and can be solved more efﬁciently when it is modeled. We make three contributions. First, we motivate our work, deﬁne a structured backup operator, and prove that it is a contraction. Second, we show how to evaluate our operator efﬁciently. Our analysis reveals that the ﬁxed point of the operator is the optimal value function in a special factored MDP. Finally, we evaluate our method on a synthetic problem and compare it to two KBRL baselines. In most experiments, we learn better policies than the baselines from an order of magnitude less training data.

PDF Details

AAAI Conference 2012 Conference Paper

Kernel-Based Reinforcement Learning on Representative States

Branislav Kveton
Georgios Theocharous

Markov decision processes (MDPs) are an established framework for solving sequential decision-making problems under uncertainty. In this work, we propose a new method for batchmode reinforcement learning (RL) with continuous state variables. The method is an approximation to kernel-based RL on a set of k representative states. Similarly to kernel-based RL, our solution is a fixed point of a kernelized Bellman operator and can approximate the optimal solution to an arbitrary level of granularity. Unlike kernel-based RL, our method is fast. In particular, our policies can be computed in O(n) time, where n is the number of training examples. The time complexity of kernel-based RL is Ω(n2 ). We introduce our method, analyze its convergence, and compare it to existing work. The method is evaluated on two existing control problems with 2 to 4 continuous variables and a new problem with 64 variables. In all cases, we outperform state-of-the-art results and offer simpler solutions.

PDF Details

AAAI Conference 2010 Conference Paper

Compressing POMDPs Using Locality Preserving Non-Negative Matrix Factorization

Georgios Theocharous
Sridhar Mahadevan

Partially Observable Markov Decision Processes (POMDPs) are a well-established and rigorous framework for sequential decision-making under uncertainty. POMDPs are well-known to be intractable to solve exactly, and there has been significant work on finding tractable approximation methods. One well-studied approach is to find a compression of the original POMDP by projecting the belief states to a lower-dimensional space. We present a novel dimensionality reduction method for POMDPs based on locality preserving non-negative matrix factorization. Unlike previous approaches, such as Krylov compression and regular non-negative matrix factorization, our approach preserves the local geometry of the belief space manifold. We present results on standard benchmark POMDPs showing improved performance over previously explored compression algorithms for POMDPs.

PDF Details

AAAI Conference 2008 Conference Paper

Online Learning with Expert Advice and Finite-Horizon Constraints

Branislav Kveton
Georgios Theocharous

In this paper, we study a sequential decision making problem. The objective is to maximize the average reward accumulated over time subject to temporal cost constraints. The novelty of our setup is that the rewards and constraints are controlled by an adverse opponent. To solve our problem in a practical way, we propose an expert algorithm that guarantees both a vanishing regret and a sublinear number of violated constraints. The quality of this solution is demonstrated on a real-world power management problem. Our results support the hypothesis that online learning with convex cost constraints can be performed successfully in practice.

PDF Details

ICRA Conference 2004 Conference Paper

Representing Hierarchical POMDPs as DBNs for Multi-scale Robot Localization

Georgios Theocharous
Kevin Murphy 0002
Leslie Pack Kaelbling

We explore the advantages of representing hierarchical partially observable Markov decision processes (H-POMDPs) as dynamic Bayesian networks (DBNs). In particular, we focus on the special case of using H-POMDPs to represent multi-resolution spatial maps for indoor robot navigation. Our results show that a DBN representation of H-POMDPs can train significantly faster than the original learning algorithm for H-POMDPs or the equivalent flat POMDP, and requires much less data. In addition, the DBN formulation can easily be extended to parameter tying and factoring of variables, which further reduces the time and sample complexity. This enables us to apply H-POMDP methods to much larger problems than previously possible.

Details

NeurIPS Conference 2003 Conference Paper

Approximate Planning in POMDPs with Macro-Actions

Georgios Theocharous
Leslie Kaelbling

Leslie Pack Kaelbling

PDF Details

ICRA Conference 2002 Conference Paper

Approximate Planning with Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation

Georgios Theocharous
Sridhar Mahadevan

We propose and investigate a planning framework based on the hierarchical partially observable Markov decision process model (HPOMDP), and apply it to robot navigation. We show how this framework can be used to produce more robust plans as compared to flat models such as partially observable Markov decision processes (POMDPs). In our approach the environment is modeled at different levels of resolution, where abstract states represent both spatial and temporal abstraction. We test our hierarchical POMDP approach using a large simulated and real navigation environment. The results show that the robot is more successful in navigating to goals starting with no positional knowledge (uniform initial belief state distribution) using the hierarchical POMDP framework as compared to the flat POMDP approach.

Details

IROS Conference 2002 Conference Paper

Learning the hierarchical structure of spatial environments using multiresolution statistical models

Georgios Theocharous
Sridhar Mahadevan

We explore the use of hierarchical Partially Observable Markov Decision Process (HPOMDP) models to represent and learn a multiresolution spatial structure representation of indoor office environments. The hierarchical POMDP model is based on the hierarchical Hidden Markov Model (HHMM). HPOMDPs can be learned from sequences of observations using an extension of the hierarchical Baum-Welch estimation algorithm for HHMMs. We apply the HPOMDP model to indoor robot navigation and show how this framework can be used to represent multiresolution spatial maps. In the HPOMDP framework the environment is modeled at different levels of resolutions where abstract states represent both spatial and temporal abstraction. We test our hierarchical POMDP approach using a large simulated (modeled after a real environment) navigation environment. The results show that the hierarchical POMDP model is more capable in inferring the spatial structure than a uniform resolution "flat" POMDP.

Details

ICRA Conference 2001 Conference Paper

Learning Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation

Georgios Theocharous
Khashayar Rohanimanesh
Sridhar Mahadevan

We propose and investigate a general framework for hierarchical modeling of partially observable environments, such as office buildings, using hierarchical hidden Markov models (HHMMs). Our main goal is to explore hierarchical modeling as a basis for designing more efficient methods for model construction and usage. As a case study we focus on indoor robot navigation and show how this framework can be used to learn a hierarchy of models of the environment at different levels of spatial abstraction. We introduce the idea of model reuse that can be used to combine already learned models into a larger model. We describe an extension of the HHMM model to includes actions, which we call hierarchical POMDPs, and describe a modified hierarchical Baum-Welch algorithm to learn these models. We train different families of hierarchical models for a simulated and a real world corridor environment and compare them with the standard "flat" representation of the same environment. We show that the hierarchical POMDP approach, combined with model reuse, allows learning hierarchical models that fit the data better and train faster than flat models.

Details