Arrow Research search

Author name cluster

Shie Mannor

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

196 papers
2 author rows

Possible papers

196

ICML Conference 2025 Conference Paper

A Classification View on Meta Learning Bandits

  • Mirco Mutti
  • Jeongyeol Kwon
  • Shie Mannor
  • Aviv Tamar

Contextual multi-armed bandits are a popular choice to model sequential decision-making. E. g. , in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When human design strategies, they aim for the exploration to be fast, since the patient’s health is at stake, and easy to interpret for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original classification view to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable decision tree probing decisions’ payoff to classify the test bandit. The test regret of the plan in the stochastic and contextual setting scales with $O (\lambda^{-2} C_{\lambda} (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $\lambda$ a separation parameter over the bandits, and $C_\lambda (\mathbb{M})$ a novel classification-coefficient that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_\lambda (\mathbb{M})$ inherently captures the complexity of the setting.

EWRL Workshop 2025 Workshop Paper

A Classification View on Meta Learning Bandits

  • Mirco Mutti
  • Jeongyeol Kwon
  • Shie Mannor
  • Aviv Tamar

Contextual multi-armed bandits are a popular choice to model sequential decision-making. E. g. , in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When humans design strategies, they aim for the exploration to be fast, since the patient's health is at stake, and easy to interpret for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original classification view to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable decision tree probing decisions' payoff to classify the test bandit. The test regret of the plan in the stochastic and contextual setting scales with $O (\lambda^{-2} C_{\lambda} (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $\lambda$ a separation parameter over the bandits, and $C_\lambda (\mathbb{M})$ a novel classification-coefficient that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_\lambda (\mathbb{M})$ inherently captures the complexity of the setting.

NeurIPS Conference 2025 Conference Paper

Efficient Fairness-Performance Pareto Front Computation

  • Mark Kozdoba
  • Binyamin Perets
  • Shie Mannor

There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. In this paper we propose a new method to compute the optimal Pareto front of this trade off. In contrast to the existing methods, this approach does not require the training of complex fair representation models. Our approach is derived through three main steps: We analyze fair representations theoretically, and derive several structural properties of optimal representations. We then show that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. Finally, we show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods. In addition to representations, we show that the new methods may also be used to directly compute the Pareto front of fair classification problems. Moreover, the proposed methods may be used with any concave performance measure. This is in contrast to the existing reduction approaches, developed recently in fair classification, which rely explicitly on the structure of the non-differentiable accuracy measure, and are thus unlikely to be extendable. The approach was evaluated on several real world benchmark datasets and compares favorably to a number of recent state of the art fair representation and classification methods.

ICLR Conference 2025 Conference Paper

Global Convergence of Policy Gradient in Average Reward MDPs

  • Navdeep Kumar
  • Yashaswini Murthy
  • Itai Shufaro
  • Kfir Yehuda Levy
  • R. Srikant 0001
  • Shie Mannor

We present the first comprehensive finite-time global convergence analysis of policy gradient for infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O(\frac{1}{T})$, where $T$ represents the number of iterations. Performance bounds for discounted reward MDPs cannot be easily extended to average reward MDPs as the bounds grow proportional to the fifth power of the effective horizon. Recent work on such extensions makes a smoothness assumption that has not been verified. Thus, our primary contribution is in providing the first complete proof that the policy gradient algorithm converges globally for average-reward MDPs, without such an assumption. We also obtain the corresponding finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations that empirically validate the result.

NeurIPS Conference 2025 Conference Paper

Non-rectangular Robust MDPs with Normed Uncertainty Sets

  • Navdeep Kumar
  • Adarsh Gupta
  • Maxence Mohamed Elfatihi
  • Giorgia Ramponi
  • Kfir Y. Levy
  • Shie Mannor

Robust policy evaluation for non-rectangular uncertainty set is generally NP-hard, even in approximation. Consequently, existing approaches suffer from either exponential iteration complexity or significant accuracy gaps. Interestingly, we identify a powerful class of $L_p$-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage its structural properties to derive a novel dual formulation for $L_p$ robust Markov Decision Processes (MDPs). This formulation reveals key insights into the adversary’s strategy and leads to the \textbf{first polynomial-time robust policy evaluation algorithm} for $L_1$-normed non-rectangular robust MDPs.

EWRL Workshop 2025 Workshop Paper

Non-rectangular Robust MDPs with Normed Uncertainty Sets

  • Navdeep Kumar
  • Adarsh Gupta
  • Maxence Mohamed Elfatihi
  • Giorgia Ramponi
  • Kfir Yehuda Levy
  • Shie Mannor

Robust policy evaluation for non-rectangular uncertainty set is generally NP-hard, even in approximation. Consequently, existing approaches suffer from either exponential iteration complexity or significant accuracy gaps. Interestingly, we identify a powerful class of Lp-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \texttt{sa}-rectangular Lp-bounded sets and leverage its structural properties to derive a novel dual formulation for Lp robust Markov Decision Processes (MDPs). This formulation provides key insights into the adversary’s strategy and enables the development of an efficient robust policy evaluation algorithm for these Lp normed non-rectangular robust MDPs.

ICLR Conference 2025 Conference Paper

On Bits and Bandits: Quantifying the Regret-Information Trade-off

  • Itai Shufaro
  • Nadav Merlis
  • Nir Weinberger
  • Shie Mannor

In many sequential decision problems, an agent performs a repeated task. He then suffers regret and obtains information that he may use in the following rounds. However, sometimes the agent may also obtain information and avoid suffering regret by querying external sources. We study the trade-off between the information an agent accumulates and the regret it suffers. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. We also prove regret upper bounds using the amount of information the agent accumulates. These bounds show that information measured in bits, can be traded off for regret, measured in reward. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.

NeurIPS Conference 2025 Conference Paper

On the Convergence of Single-Timescale Actor-Critic

  • Navdeep Kumar
  • Priyank Agrawal
  • Giorgia Ramponi
  • Kfir Y. Levy
  • Shie Mannor

We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of $ O(\epsilon^{-3}) $. This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as $ O(k^{-\frac{2}{3}}) $ with iteration $k$, diverging from the conventional $O(k^{-\frac{1}{2}}) $ rates commonly used in (non)convex optimization.

EWRL Workshop 2025 Workshop Paper

On the Convergence of Single-Timescale Actor-Critic

  • Navdeep Kumar
  • Priyank Agrawal
  • Giorgia Ramponi
  • Kfir Yehuda Levy
  • Shie Mannor

We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of $ O(\epsilon^{-3}) $. This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as $ O(k^{-\frac{2}{3}}) $ with iteration $k$, diverging from the conventional $ O(k^{-\frac{1}{2}}) $ rates commonly used in (non)convex optimization.

ICML Conference 2025 Conference Paper

Policy Gradient with Tree Expansion

  • Gal Dalal
  • Assaf Hallak
  • Gugan Thoppe
  • Shie Mannor
  • Gal Chechik

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax—a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

NeurIPS Conference 2025 Conference Paper

Policy Optimized Text-to-Image Pipeline Design

  • Uri Gadot
  • Rinon Gal
  • Yftah Ziser
  • Gal Chechik
  • Shie Mannor

Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines that combine various enhancement tools. While these pipelines significantly improve image quality, their effective design requires substantial expertise. Recent approaches automating this process through large language models (LLMs) have shown promise but suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow prediction training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.

ICML Conference 2025 Conference Paper

Reinforcement Learning with Segment Feedback

  • Yihan Du
  • Anna Winnicki
  • Gal Dalal
  • Shie Mannor
  • R. Srikant 0001

Standard reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback. In this model, we consider an episodic Markov decision process (MDP), where each episode is divided into $m$ segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impact of the number of segments $m$ on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and experimental results show that: under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.

NeurIPS Conference 2025 Conference Paper

State Entropy Regularization for Robust Reinforcement Learning

  • Yonatan Ashlag
  • Uri Koren
  • Mirco Mutti
  • Esther Derman
  • Pierre-Luc Bacon
  • Shie Mannor

State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.

ICML Conference 2024 Conference Paper

Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating The Worst Kernel

  • Uri Gadot
  • Kaixin Wang
  • Navdeep Kumar
  • Kfir Yehuda Levy
  • Shie Mannor

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, current RMDP methods are often limited to small-scale problems, hindering their use in high-dimensional domains. To bridge this gap, we present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies. Unlike previous works that regularize the policy or value updates, EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process. Notably, EWoK can be applied on top of any off-the-shelf non-robust RL algorithm, enabling easy scaling to high-dimensional domains. Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of the EWoK paradigm as a practical method for learning robust policies.

ICML Conference 2024 Conference Paper

Efficient Value Iteration for s-rectangular Robust Markov Decision Processes

  • Navdeep Kumar
  • Kaixin Wang
  • Kfir Yehuda Levy
  • Shie Mannor

We focus on s-rectangular robust Markov decision processes (MDPs), which capture interconnected uncertainties across different actions within each state. This framework is more general compared to sa-rectangular robust MDPs, where uncertainties in each action are independent. However, the introduced interdependence significantly amplifies the complexity of the problem. Existing methods either have slow performance guarantees or are inapplicable to even moderately large state spaces. In this work, we derive optimal robust Bellman operators in explicit forms. This leads to robust value iteration methods with significantly faster time complexities than existing approaches, which can be used in large state spaces. Further, our findings reveal that the optimal policies demonstrate a novel threshold behavior, selectively favoring a limited set of actions based on their respective advantage functions. Additionally, our study uncovers a noteworthy connection between the robustness of a policy and the variance in its value function, highlighting that policies with lower variance exhibit greater resilience.

ICML Conference 2024 Conference Paper

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization

  • Yihan Du
  • Anna Winnicki
  • Gal Dalal
  • Shie Mannor
  • R. Srikant 0001

Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed and the algorithm relies on trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to get good performance with RLHF. A key novelty is our trajectory-level elliptical potential analysis technique used to infer reward function parameters when comparison queries rather than reward observations are used. We provide and analyze algorithms in two settings: linear and neural function approximation, PG-RLHF and NN-PG-RLHF, respectively.

ICML Conference 2024 Conference Paper

Improving Token-Based World Models with Parallel Observation Prediction

  • Lior Cohen
  • Kaixin Wang
  • Bingyi Kang
  • Shie Mannor

Motivated by the success of Transformers when applied to sequences of discrete symbols, token-based world models (TBWMs) were recently proposed as sample-efficient methods. In TBWMs, the world model consumes agent experience as a language-like sequence of tokens, where each observation constitutes a sub-sequence. However, during imagination, the sequential token-by-token generation of next observations results in a severe bottleneck, leading to long training times, poor GPU utilization, and limited representations. To resolve this bottleneck, we devise a novel Parallel Observation Prediction (POP) mechanism. POP augments a Retentive Network (RetNet) with a novel forward mode tailored to our reinforcement learning setting. We incorporate POP in a novel TBWM agent named REM (Retentive Environment Model), showcasing a 15. 4x faster imagination compared to prior TBWMs. REM attains superhuman performance on 12 out of 26 games of the Atari 100K benchmark, while training in less than 12 hours. Our code is available at https: //github. com/leor-c/REM

ICML Conference 2024 Conference Paper

Prospective Side Information for Latent MDPs

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Shie Mannor
  • Constantine Caramanis

In many interactive decision-making problems, there is contextual side information that remains fixed within the course of an interaction. This problem has been studied quite extensively under the assumption the context is fully observed, as well as in the opposing limit when the context is unobserved, a special type of POMDP also referred to as a Latent MDP (LMDP). In this work, we consider a class of decision problems that interpolates between the settings, namely, between the case the context is fully observed, and the case the context is unobserved. We refer to this class of decision problems as LMDPs with prospective side information. In such an environment an agent receives additional, weakly revealing, information on the latent context at the beginning of each episode. We show that, surprisingly, this problem is not captured by contemporary POMDP settings and is not solved by RL algorithms designed for partially observed environments. We then establish that any sample efficient algorithm must suffer at least $\Omega(K^{2/3})$-regret, as opposed to standard $\Omega(\sqrt{K})$ lower bounds. We design an algorithm with a matching upper bound that depends only polynomially on the problem parameters. This establishes exponential improvement in the sample complexity relative to the existing LMDP lower bound, when prospective information is not given in prior work.

NeurIPS Conference 2024 Conference Paper

RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation

  • Jeongyeol Kwon
  • Shie Mannor
  • Constantine Caramanis
  • Yonathan Efroni

In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent initially. In last decade, there has been significant progress in designing learning algorithms for solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound. We effectively resolve this open question, introducing the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role off-policy evaluation guarantees and coverage coefficient in LMDPs, a perspective, which has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond the LMDP class, and especially, for partially observed environments.

ICML Conference 2024 Conference Paper

Sobolev Space Regularised Pre Density Models

  • Mark Kozdoba
  • Binyamin Perets
  • Shie Mannor

We propose a new approach to non-parametric density estimation that is based on regularizing a Sobolev norm of the density. This method is statistically consistent, and makes the inductive bias of the model clear and interpretable. While there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. The optimization problem needed to determine the density is non-convex, and standard gradient methods do not perform well. However, we show that with an appropriate initialization and using natural gradients, one can obtain well performing solutions. Finally, while the approach provides pre-densities (i. e. not necessarily integrating to 1), which prevents the use of log-likelihood for cross validation, we show that one can instead adapt Fisher divergence based score matching methods for this task. We evaluate the resulting method on the comprehensive recent anomaly detection benchmark suite, ADBench, and find that it ranks second best, among more than 15 algorithms.

AAAI Conference 2024 Conference Paper

Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization

  • Uri Gadot
  • Esther Derman
  • Navdeep Kumar
  • Maxence Mohamed Elfatihi
  • Kfir Levy
  • Shie Mannor

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

ICLR Conference 2024 Conference Paper

Tree Search-Based Policy Optimization under Stochastic Execution Delay

  • David Valensi
  • Esther Derman
  • Shie Mannor
  • Gal Dalal

The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through empirical analysis, we stress that none of the prior benchmarks consistently outperforms others across different delays. We demonstrate that our algorithm surpasses all benchmark methods in Atari games when dealing with constant or stochastic delays. The code is available at \url{https://github.com/davidva1/Delayed-EZ}.

NeurIPS Conference 2023 Conference Paper

Individualized Dosing Dynamics via Neural Eigen Decomposition

  • Stav Belogolovsky
  • Ido Greenberg
  • Danny Eytan
  • Shie Mannor

Dosing models often use differential equations to model biological dynamics. Neural differential equations in particular can learn to predict the derivative of a process, which permits predictions at irregular points of time. However, this temporal flexibility often comes with a high sensitivity to noise, whereas medical problems often present high noise and limited data. Moreover, medical dosing models must generalize reliably over individual patients and changing treatment policies. To address these challenges, we introduce the Neural Eigen Stochastic Differential Equation algorithm (NESDE). NESDE provides individualized modeling (using a hypernetwork over patient-level parameters); generalization to new treatment policies (using decoupled control); tunable expressiveness according to the noise level (using piecewise linearity); and fast, continuous, closed-form prediction (using spectral representation). We demonstrate the robustness of NESDE in both synthetic and real medical problems, and use the learned dynamics to publish simulated medical gym environments.

ICML Conference 2023 Conference Paper

Learning Hidden Markov Models When the Locations of Missing Observations are Unknown

  • Binyamin Perets
  • Mark Kozdoba
  • Shie Mannor

The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis. One of the key reasons for this versatility is the ability of HMM to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In the natural sciences, where this assumption is often violated, special variants of HMM, commonly known as Silent-state HMMs (SHMMs), are used. Despite their widespread use, these algorithms strongly rely on specific structural assumptions of the underlying chain, such as acyclicity, thus limiting the applicability of these methods. Moreover, even in the acyclic case, it has been shown that these methods can lead to poor reconstruction. In this paper we consider the general problem of learning an HMM from data with unknown missing observation locations. We provide reconstruction algorithms that do not require any assumptions about the structure of the underlying chain, and can also be used with limited prior knowledge, unlike SHMM. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision, and robustness under model miss-specification. Notably, we show that under proper specifications one can reconstruct the process dynamics as well as if the missing observations positions were known.

ICML Conference 2023 Conference Paper

Learning to Initiate and Reason in Event-Driven Cascading Processes

  • Yuval Atzmon
  • Eli A. Meirom
  • Shie Mannor
  • Gal Chechik

Training agents to control a dynamic environment is a fundamental task in AI. In many environments, the dynamics can be summarized by a small set of events that capture the semantic behavior of the system. Typically, these events form chains or cascades. We often wish to change the system behavior using a single intervention that propagates through the cascade. For instance, one may trigger a biochemical cascade to switch the state of a cell or, in logistics, reroute a truck to meet an unexpected, urgent delivery. We introduce a new supervised learning setup called Cascade. An agent observes a system with known dynamics evolving from some initial state. The agent is given a structured semantic instruction and needs to make an intervention that triggers a cascade of events, such that the system reaches an alternative (counterfactual) behavior. We provide a test-bed for this problem, consisting of physical objects. We combine semantic tree search with an event-driven forward model and devise an algorithm that learns to efficiently search in exponentially large semantic trees. We demonstrate that our approach learns to follow instructions to intervene in new complex scenes. When provided with an observed cascade of events, it can also reason about alternative outcomes.

NeurIPS Conference 2023 Conference Paper

Optimization or Architecture: How to Hack Kalman Filtering

  • Ido Greenberg
  • Netanel Yannay
  • Shie Mannor

In non-linear filtering, it is traditional to compare non-linear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this mixes the evaluation of two separate components: the non-linear architecture, and the parameters optimization method. In particular, the non-linear model is often optimized, whereas the reference KF model is not. We argue that both should be optimized similarly, and to that end present the Optimized KF (OKF). We demonstrate that the KF may become competitive to neural models – if optimized using OKF. This implies that experimental conclusions of certain previous studies were derived from a flawed process. The advantage of OKF over the standard KF is further studied theoretically and empirically, in a variety of problems. Conveniently, OKF can replace the KF in real-world systems by merely updating the parameters.

AAAI Conference 2023 Conference Paper

Planning and Learning with Adaptive Lookahead

  • Aviv Rosenberg
  • Assaf Hallak
  • Shie Mannor
  • Gal Chechik
  • Gal Dalal

Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.

NeurIPS Conference 2023 Conference Paper

Policy Gradient for Rectangular Robust Markov Decision Processes

  • Navdeep Kumar
  • Esther Derman
  • Matthieu Geist
  • Kfir Y. Levy
  • Shie Mannor

Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

ICML Conference 2023 Conference Paper

PPG Reloaded: An Empirical Study on What Matters in Phasic Policy Gradient

  • Kaixin Wang
  • Daquan Zhou
  • Jiashi Feng
  • Shie Mannor

In model-free reinforcement learning, recent methods based on a phasic policy gradient (PPG) framework have shown impressive improvements in sample efficiency and zero-shot generalization on the challenging Procgen benchmark. In PPG, two design choices are believed to be the key contributing factors to its superior performance over PPO: the high level of value sample reuse and the low frequency of feature distillation. However, through an extensive empirical study, we unveil that policy regularization and data diversity are what actually matters. In particular, we can achieve the same level of performance with low value sample reuse and frequent feature distillation, as long as the policy regularization strength and data diversity are preserved. In addition, we can maintain the high performance of PPG while reducing the computational cost to a similar level as PPO. Our comprehensive study covers all 16 Procgen games in both sample efficiency and generalization setups. We hope it can advance the understanding of PPG and provide insights for future works.

EWRL Workshop 2023 Workshop Paper

Representation-Driven Reinforcement Learning

  • Ofir Nabati
  • Guy Tennenholtz
  • Shie Mannor

We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.

ICML Conference 2023 Conference Paper

Representation-Driven Reinforcement Learning

  • Ofir Nabati
  • Guy Tennenholtz
  • Shie Mannor

We present a representation-driven framework for reinforcement learning. By representing policies as estimates of their expected values, we leverage techniques from contextual bandits to guide exploration and exploitation. Particularly, embedding a policy network into a linear feature space allows us to reframe the exploration-exploitation problem as a representation-exploitation problem, where good policy representations enable optimal exploration. We demonstrate the effectiveness of this framework through its application to evolutionary and policy gradient-based approaches, leading to significantly improved performance compared to traditional methods. Our framework provides a new perspective on reinforcement learning, highlighting the importance of policy representation in determining optimal exploration-exploitation strategies.

ICML Conference 2023 Conference Paper

Reward-Mixing MDPs with Few Latent Contexts are Learnable

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Constantine Caramanis
  • Shie Mannor

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Prior work established an upper bound for RMMDPs with $M=2$. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary $M\ge2$ and provide a sample-efficient algorithm–$EM^2$–that outputs an $\epsilon$-optimal policy using $O \left(\epsilon^{-2} \cdot S^d A^d \cdot \text{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=O(\min(M, H))$. We also provide a $(SA)^{\Omega(\sqrt{M})} / \epsilon^{2}$ lower bound, supporting that super-polynomial sample complexity in $M$ is necessary.

EWRL Workshop 2023 Workshop Paper

Robust Reinforcement Learning via Adversarial Kernel Approximation

  • Kaixin Wang
  • Uri Gadot
  • Navdeep Kumar
  • Kfir Yehuda Levy
  • Shie Mannor

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, robust reinforcement learning (RL) approaches in RMDPs do not scale well to realistic online settings with high-dimensional domains. By characterizing the adversarial kernel in RMDPs, we propose a novel approach for online robust RL that approximates the adversarial kernel and uses a standard (non-robust) RL algorithm to learn a robust policy. Notably, our approach can be applied on top of any underlying RL algorithm, enabling easy scaling to high-dimensional domains. Experiments in classic control tasks, MinAtar and DeepMind Control Suite demonstrate the effectiveness and the applicability of our method.

EWRL Workshop 2023 Workshop Paper

Towards Faster Global Convergence of Robust Policy Gradient Methods

  • Navdeep Kumar
  • Ilnura Usmanova
  • Kfir Yehuda Levy
  • Shie Mannor

Recently, global convergence has been achieved for non-robust MDPs with an iteration complexity of $O(\frac{1}{\epsilon})$ for finding an $\epsilon$-optimal policy, for which PL condition derived from performance difference lemma has played a key role. This work extends performance difference lemma to \texttt{s}-rectangular robust MDPs from which PL condition can be derived. We further, present a simplified proof for the policy gradient convergence for non-robust case, which together with robust performance difference lemma, can lead to global convergence of robust policy gradient.

EWRL Workshop 2023 Workshop Paper

Train Hard, Fight Easy: Robust Meta Reinforcement Learning

  • Ido Greenberg
  • Shie Mannor
  • Gal Chechik
  • Eli Meirom

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Disturbingly, optimization of analogous robust objectives in RL is known to lead to both *biased gradients* and *data inefficiency*. The gradient bias is proven to disappear in MRL, which further motivates the proposed framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

NeurIPS Conference 2023 Conference Paper

Train Hard, Fight Easy: Robust Meta Reinforcement Learning

  • Ido Greenberg
  • Shie Mannor
  • Gal Chechik
  • Eli Meirom

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL is known to lead to both biased gradients and data inefficiency. We prove that the gradient bias disappears in our proposed MRL framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

EWRL Workshop 2022 Workshop Paper

$Q$-Learning for $L_p$ Robust Markov Decision Processes

  • Navdeep Kumar
  • Kaixin Wang
  • Kfir Levy
  • Shie Mannor

Robust Markov Decision Processes (MDPs) are a powerful tool to solve the sequential decision-making problem where system parameters are partially known or changing or adversarial. Recently, there have been works aimed at solving sa and s-rectangular robust MDPs. The methods are model-based that can potentially be generalized to model-free settings. We formally propose model-free algorithm for sa and s-rectangular Lp robust MDPs and provide its convergence guarantees. The proposed model-free algorithms can be combined with existing deep RL techniques such as DQN etc. to solve challenging problems.

ICML Conference 2022 Conference Paper

Actor-Critic based Improper Reinforcement Learning

  • Mohammadi Zaki
  • Avi Mohan
  • Aditya Gopalan
  • Shie Mannor

We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an inverted pendulum; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.

ICML Conference 2022 Conference Paper

Analysis of Stochastic Processes through Replay Buffers

  • Shirli Di-Castro Shashua
  • Shie Mannor
  • Dotan Di Castro

Replay buffers are a key component in many reinforcement learning schemes. Yet, their theoretical properties are not fully understood. In this paper we analyze a system where a stochastic process X is pushed into a replay buffer and then randomly sampled to generate a stochastic process Y from the replay buffer. We provide an analysis of the properties of the sampled process such as stationarity, Markovity and autocorrelation in terms of the properties of the original process. Our theoretical analysis sheds light on why replay buffer may be a good de-correlator. Our analysis provides theoretical tools for proving the convergence of replay buffer based algorithms which are prevalent in reinforcement learning schemes.

ICML Conference 2022 Conference Paper

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Constantine Caramanis
  • Shie Mannor

Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $\alpha per-user interactions to learn an $\epsilon$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S, A)\cdot \alpha/\epsilon^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.

EWRL Workshop 2022 Workshop Paper

Cross-Entropy Soft-Risk Reinforcement Learning

  • Ido Greenberg
  • Yinlam Chow
  • Mohammad Ghavamzadeh
  • Shie Mannor

In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent’s experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a mechanism we call soft risk to bypass it. We also devise a novel cross entropy module for sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR – Cross-entropy Soft-Risk optimization algorithm – which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails. Our experiments are available on Github, and the cross entropy module on PyPI.

NeurIPS Conference 2022 Conference Paper

Efficient Risk-Averse Reinforcement Learning

  • Ido Greenberg
  • Yinlam Chow
  • Mohammad Ghavamzadeh
  • Shie Mannor

In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent's experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a mechanism we call soft risk to bypass it. We also devise a novel cross entropy module for sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails.

NeurIPS Conference 2022 Conference Paper

Finite Sample Analysis Of Dynamic Regression Parameter Learning

  • Mark Kozdoba
  • Edward Moroshko
  • Shie Mannor
  • Yacov Crammer

We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.

AAAI Conference 2022 Conference Paper

Locality Matters: A Scalable Value Decomposition Approach for Cooperative Multi-Agent Reinforcement Learning

  • Roy Zohar
  • Shie Mannor
  • Guy Tennenholtz

Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents. As environments grow in size, effective credit assignment becomes increasingly harder and often results in infeasible learning times. Still, in many real-world settings, there exist simplified underlying dynamics that can be leveraged for more scalable solutions. In this work, we exploit such locality structures effectively whilst maintaining global cooperation. We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Centralized Training Decentralized Execution paradigm. Additionally, we provide a direct reward decomposition method for finding these local rewards when only a global signal is provided. We test our method empirically, showing it scales well compared to other methods, significantly improving performance and convergence speed.

ICLR Conference 2022 Conference Paper

On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning

  • Guy Tennenholtz
  • Assaf Hallak
  • Gal Dalal
  • Shie Mannor
  • Gal Chechik
  • Uri Shalit

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

AAAI Conference 2022 Conference Paper

Online Apprenticeship Learning

  • Lior Shani
  • Tom Zahavy
  • Shie Mannor

In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert’s performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. By employing optimistic exploration, we derive a convergent algorithm with O( √ K) regret, where K is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL, but where the discriminator is replaced with the costs learned by OAL. Our simulations suggest that OAL performs well in high dimensional control problems.

ICML Conference 2022 Conference Paper

Optimizing Tensor Network Contraction Using Reinforcement Learning

  • Eli A. Meirom
  • Haggai Maron
  • Shie Mannor
  • Gal Chechik

Quantum Computing (QC) stands to revolutionize computing, but is currently still limited. To develop and test quantum algorithms today, quantum circuits are often simulated on classical computers. Simulating a complex quantum circuit requires computing the contraction of a large network of tensors. The order (path) of contraction can have a drastic effect on the computing cost, but finding an efficient order is a challenging combinatorial optimization problem. We propose a Reinforcement Learning (RL) approach combined with Graph Neural Networks (GNN) to address the contraction ordering problem. The problem is extremely challenging due to the huge search space, the heavy-tailed reward distribution, and the challenging credit assignment. We show how a carefully implemented RL-agent that uses a GNN as the basic policy construct can address these challenges and obtain significant improvements over state-of-the-art techniques in three varieties of circuits, including the largest scale networks used in contemporary QC.

NeurIPS Conference 2022 Conference Paper

Reinforcement Learning with a Terminator

  • Guy Tennenholtz
  • Nadav Merlis
  • Lior Shani
  • Shie Mannor
  • Uri Shalit
  • Gal Chechik
  • Assaf Hallak
  • Gal Dalal

We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w. r. t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.

EWRL Workshop 2022 Workshop Paper

Reinforcement Learning with a Terminator

  • Tennenholtz Guy
  • Nadav Merlis
  • Lior Shani
  • Shie Mannor
  • Uri Shalit
  • Gal Chechik
  • Assaf Hallak
  • Gal Dalal

We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w. r. t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.

ICML Conference 2022 Conference Paper

The Geometry of Robust Value Functions

  • Kaixin Wang
  • Navdeep Kumar
  • Kuangqi Zhou
  • Bryan Hooi
  • Jiashi Feng
  • Shie Mannor

The space of value functions is a fundamental concept in reinforcement learning. Characterizing its geometric properties may provide insights for optimization and representation. Existing works mainly focus on the value space for Markov Decision Processes (MDPs). In this paper, we study the geometry of the robust value space for the more general Robust MDPs (RMDPs) setting, where transition uncertainties are considered. Specifically, since we find it hard to directly adapt prior approaches to RMDPs, we start with revisiting the non-robust case, and introduce a new perspective that enables us to characterize both the non-robust and robust value space in a similar fashion. The key of this perspective is to decompose the value space, in a state-wise manner, into unions of hypersurfaces. Through our analysis, we show that the robust value space is determined by a set of conic hypersurfaces, each of which contains the robust values of all policies that agree on one state. Furthermore, we find that taking only extreme points in the uncertainty set is sufficient to determine the robust value space. Finally, we discuss some other aspects about the robust value space, including its non-convexity and policy agreement on multiple states.

NeurIPS Conference 2022 Conference Paper

Tractable Optimality in Episodic Latent MABs

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Constantine Caramanis
  • Shie Mannor

We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. The resulting partial observation of the environment makes the learning task significantly more challenging. Without any additional structural assumptions, existing techniques to tackle partially observed settings imply the decision maker can learn a near-optimal policy with $O(A)^H$ episodes, but do not promise more. In this work, we show that learning with {\em polynomial} samples in $A$ is possible. We achieve this by using techniques from experiment design. Then, through a method-of-moments approach, we design a procedure that provably learns a near-optimal policy with $O(\poly(A) + \poly(M, H)^{\min(M, H)})$ interactions. In practice, we show that we can formulate the moment-matching via maximum likelihood estimation. In our experiments, this significantly outperforms the worst-case guarantees, as well as existing practical methods.

NeurIPS Conference 2022 Conference Paper

Uncertainty Estimation Using Riemannian Model Dynamics for Offline Reinforcement Learning

  • Guy Tennenholtz
  • Shie Mannor

Model-based offline reinforcement learning approaches generally rely on bounds of model error. Estimating these bounds is usually achieved through uncertainty estimation methods. In this work, we combine parametric and nonparametric methods for uncertainty estimation through a novel latent space based metric. In particular, we build upon recent advances in Riemannian geometry of generative models to construct a pullback metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out-of-distribution samples as well as the discrepancy of examples in the data. We leverage our combined method for uncertainty estimation in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline approaches on continuous control and autonomous driving benchmarks.

ICLR Conference 2021 Conference Paper

Acting in Delayed Environments with Non-Stationary Markov Policies

  • Esther Derman
  • Gal Dalal
  • Shie Mannor

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at \url{https://github.com/galdl/rl_delay_basic.git}.

UAI Conference 2021 Conference Paper

Action redundancy in reinforcement learning

  • Nir Baram
  • Guy Tennenholtz
  • Shie Mannor

Maximum Entropy (MaxEnt) reinforcement learning is a powerful learning paradigm which seeks to maximize return under entropy regularization. However, action entropy does not necessarily coincide with state entropy, e. g. , when multiple actions produce the same transition. Instead, we propose to maximize the transition entropy, i. e. , the entropy of next states. We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy. Particularly, we explore the latter in both deterministic and stochastic settings and develop tractable approximation methods in a near model-free setup. We construct algorithms to minimize action redundancy and demonstrate their effectiveness on a synthetic environment with multiple redundant actions as well as contemporary benchmarks in Atari and Mujoco. Our results suggest that action redundancy is a fundamental problem in reinforcement learning.

UAI Conference 2021 Conference Paper

Bandits with partially observable confounded data

  • Guy Tennenholtz
  • Uri Shalit
  • Shie Mannor
  • Yonathan Efroni

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.

ICML Conference 2021 Conference Paper

Confidence-Budget Matching for Sequential Budgeted Learning

  • Yonathan Efroni
  • Nadav Merlis
  • Aadirupa Saha
  • Shie Mannor

A core element in decision-making under uncertainty is the feedback on the quality of the performed actions. However, in many applications, such feedback is restricted. For example, in recommendation systems, repeatedly asking the user to provide feedback on the quality of recommendations will annoy them. In this work, we formalize decision-making problems with querying budget, where there is a (possibly time-dependent) hard limit on the number of reward queries allowed. Specifically, we focus on multi-armed bandits, linear contextual bandits, and reinforcement learning problems. We start by analyzing the performance of ‘greedy’ algorithms that query a reward whenever they can. We show that in fully stochastic settings, doing so performs surprisingly well, but in the presence of any adversity, this might lead to linear regret. To overcome this issue, we propose the Confidence-Budget Matching (CBM) principle that queries rewards when the confidence intervals are wider than the inverse square root of the available budget. We analyze the performance of CBM based algorithms in different settings and show that it performs well in the presence of adversity in the contexts, initial states, and budgets.

ICML Conference 2021 Conference Paper

Controlling Graph Dynamics with Reinforcement Learning and Graph Neural Networks

  • Eli A. Meirom
  • Haggai Maron
  • Shie Mannor
  • Gal Chechik

We consider the problem of controlling a partially-observed dynamic process on a graph by a limited number of interventions. This problem naturally arises in contexts such as scheduling virus tests to curb an epidemic; targeted marketing in order to promote a product; and manually inspecting posts to detect fake news spreading on social networks. We formulate this setup as a sequential decision problem over a temporal graph process. In face of an exponential state space, combinatorial action space and partial observability, we design a novel tractable scheme to control dynamical processes on temporal graphs. We successfully apply our approach to two popular problems that fall into our framework: prioritizing which nodes should be tested in order to curb the spread of an epidemic, and influence maximization on a graph.

ICML Conference 2021 Conference Paper

Detecting Rewards Deterioration in Episodic Reinforcement Learning

  • Ido Greenberg
  • Shie Mannor

In many RL applications, once training ends, it is vital to detect any deterioration in the agent performance as soon as possible. Furthermore, it often has to be done without modifying the policy and under minimal assumptions regarding the environment. In this paper, we address this problem by focusing directly on the rewards and testing for degradation. We consider an episodic framework, where the rewards within each episode are not independent, nor identically-distributed, nor Markov. We present this problem as a multivariate mean-shift detection problem with possibly partial observations. We define the mean-shift in a way corresponding to deterioration of a temporal signal (such as the rewards), and derive a test for this problem with optimal statistical power. Empirically, on deteriorated rewards in control problems (generated using various environment modifications), the test is demonstrated to be more powerful than standard tests - often by orders of magnitude. We also suggest a novel Bootstrap mechanism for False Alarm Rate control (BFAR), applicable to episodic (non-i. i. d) signal and allowing our test to run sequentially in an online manner. Our method does not rely on a learned model of the environment, is entirely external to the agent, and in fact can be applied to detect changes or drifts in any episodic signal.

NeurIPS Conference 2021 Conference Paper

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

  • Gal Dalal
  • Assaf Hallak
  • Steven Dalton
  • iuri frosio
  • Shie Mannor
  • Gal Chechik

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: \textit{distribution shift} and \textit{scalability}. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow. We will share the code upon publication.

UAI Conference 2021 Conference Paper

Known unknowns: Learning novel concepts using reasoning-by-elimination

  • Harsh Agrawal
  • Eli A. Meirom
  • Yuval Atzmon
  • Shie Mannor
  • Gal Chechik

People can learn new visual concepts without any samples, from information given by language or by deductive reasoning. For instance, people can use elimination to infer the meaning of novel labels from their context. While recognizing novel concepts was intensively studied in zero-shot learning with semantic descriptions, training models to learn by elimination is much less studied. Here we describe the first approach to train an agent to reason-by-elimination, by providing instructions that contain both familiar concepts and unfamiliar ones (“pick the red box and the green wambim”). In our framework, the agent combines a perception module with a reasoning module that includes internal memory. It uses reinforcement learning to construct a reasoning policy that, by considering all available items in a room, can make a correct inference even for never-seen objects or concepts. Furthermore, it can then perform one-shot learning and use newly learned concepts for inferring additional novel concepts. We evaluate this approach in a new set of environments, showing that agents successfully learn to reason by elimination, and can also learn novel concepts and use them for further reasoning. This approach paves the way to handle open-world environments by extending the abundant supervised learning approaches with reasoning frameworks that can handle novel concepts.

AAAI Conference 2021 Conference Paper

Lenient Regret for Multi-Armed Bandits

  • Nadav Merlis
  • Shie Mannor

We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i. e. , the cumulative difference between the reward of the best action and the agent’s action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regret-minimizing algorithms tend to over-explore. To overcome this issue, algorithms for such settings should instead focus on playing near-optimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some. We then present a variant of the Thompson Sampling (TS) algorithm, called -TS, and prove its asymptotic optimality in terms of the lenient regret. Importantly, we show that when the mean of the optimal arm is high enough, the lenient regret of -TS is bounded by a constant. Finally, we show that -TS can be applied to improve the performance when the agent knows a lower bound of the suboptimality gaps.

ICML Conference 2021 Conference Paper

Online Limited Memory Neural-Linear Bandits with Likelihood Matching

  • Ofir Nabati
  • Tom Zahavy
  • Shie Mannor

We study neural-linear bandits for solving problems where {\em both} exploration and representation learning play an important role. Neural-linear bandits harnesses the representation power of Deep Neural Networks (DNNs) and combines it with efficient exploration mechanisms by leveraging uncertainty estimation of the model, designed for linear contextual bandits on top of the last hidden layer. In order to mitigate the problem of representation change during the process, new uncertainty estimations are computed using stored data from an unlimited buffer. Nevertheless, when the amount of stored data is limited, a phenomenon called catastrophic forgetting emerges. To alleviate this, we propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online. We applied our algorithm, Limited Memory Neural-Linear with Likelihood Matching (NeuralLinear-LiM2) on a variety of datasets and observed that our algorithm achieves comparable performance to the unlimited memory approach while exhibits resilience to catastrophic forgetting.

ICLR Conference 2021 Conference Paper

Optimizing Memory Placement using Evolutionary Graph Reinforcement Learning

  • Shauharda Khadka
  • Estelle Aflalo
  • Mattias Marder
  • Avrech Ben-David
  • Santiago Miret
  • Shie Mannor
  • Tamir Hazan
  • Hanlin Tang

For deep neural network accelerators, memory movement is both energetically expensive and can bound computation. Therefore, optimal mapping of tensors to memory hierarchies is critical to performance. The growing complexity of neural networks calls for automated memory mapping instead of manual heuristic approaches; yet the search space of neural network computational graphs have previously been prohibitively large. We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces, that combines graph neural networks, reinforcement learning, and evolutionary search. A set of fast, stateless policies guide the evolutionary search to improve its sample-efficiency. We train and validate our approach directly on the Intel NNP-I chip for inference. EGRL outperforms policy-gradient, evolutionary search and dynamic programming baselines on BERT, ResNet-101 and ResNet-50. We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.

NeurIPS Conference 2021 Conference Paper

Reinforcement Learning in Reward-Mixing MDPs

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Constantine Caramanis
  • Shie Mannor

Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of $M$ possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an $\epsilon$-optimal policy after exploring $\tilde{O}(poly(H, \epsilon^{-1}) \cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.

AAAI Conference 2021 Conference Paper

Reinforcement Learning with Trajectory Feedback

  • Yonathan Efroni
  • Nadav Merlis
  • Shie Mannor

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as trajectory feedback. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

NeurIPS Conference 2021 Conference Paper

RL for Latent MDPs: Regret Guarantees and a Lower Bound

  • Jeongyeol Kwon
  • Yonathan Efroni
  • Constantine Caramanis
  • Shie Mannor

In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least $\Omega((SA)^M)$ episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, {\it i. e. ,} providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e. g. , \cite{boots2011online}) and a reachability assumption, we show that the need for initialization can be removed.

NeurIPS Conference 2021 Conference Paper

Sim and Real: Better Together

  • Shirli Di-Castro
  • Dotan Di Castro
  • Shie Mannor

Simulation is used extensively in autonomous systems, particularly in robotic manipulation. By far, the most common approach is to train a controller in simulation, and then use it as an initial starting point for the real system. We demonstrate how to learn simultaneously from both simulation and interaction with the real environment. We propose an algorithm for balancing the large number of samples from the high throughput but less accurate simulation and the low-throughput, high-fidelity and costly samples from the real environment. We achieve that by maintaining a replay buffer for each environment the agent interacts with. We analyze such multi-environment interaction theoretically, and provide convergence properties, through a novel theoretical replay buffer analysis. We demonstrate the efficacy of our method on a sim-to-real environment.

NeurIPS Conference 2021 Conference Paper

Twice regularized MDPs and the equivalence between robustness and regularization

  • Esther Derman
  • Matthieu Geist
  • Shie Mannor

Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We finally generalize regularized MDPs to twice regularized MDPs (R${}^2$ MDPs), i. e. , MDPs with $\textit{both}$ value and policy regularization. The corresponding Bellman operators enable developing policy iteration schemes with convergence and robustness guarantees. It also reduces planning and learning in robust MDPs to regularized MDPs.

ICML Conference 2021 Conference Paper

Value Iteration in Continuous Actions, States and Time

  • Michael Lutter
  • Shie Mannor
  • Jan Peters 0001
  • Dieter Fox
  • Animesh Garg

Classical value iteration approaches are not applicable to environments with continuous states and actions. For such environments the states and actions must be discretized, which leads to an exponential increase in computational complexity. In this paper, we propose continuous fitted value iteration (cFVI). This algorithm enables dynamic programming for continuous states and actions with a known dynamics model. Exploiting the continuous time formulation, the optimal policy can be derived for non-linear control-affine dynamics. This closed-form solution enables the efficient extension of value iteration to continuous environments. We show in non-linear control experiments that the dynamic programming solution obtains the same quantitative performance as deep reinforcement learning methods in simulation but excels when transferred to the physical system. The policy obtained by cFVI is more robust to changes in the dynamics despite using only a deterministic model and without explicitly incorporating robustness in the optimization

AAAI Conference 2020 Conference Paper

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

  • Lior Shani
  • Yonathan Efroni
  • Shie Mannor

Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be ‘close’ to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural “RL version” of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish Õ(1/ √ N) convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of Õ(1/N), much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

AAAI Conference 2020 Conference Paper

Off-Policy Evaluation in Partially Observable Environments

  • Guy Tennenholtz
  • Uri Shalit
  • Shie Mannor

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

NeurIPS Conference 2020 Conference Paper

Online Planning with Lookahead Policies

  • Yonathan Efroni
  • Mohammad Ghavamzadeh
  • Shie Mannor

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i. e. , it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors.

ICML Conference 2020 Conference Paper

Optimistic Policy Optimization with Bandit Feedback

  • Lior Shani
  • Yonathan Efroni
  • Aviv Rosenberg 0002
  • Shie Mannor

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $\tilde O(\sqrt{S^2 A H^4 K})$ regret for stochastic rewards. Furthermore, we prove $\tilde O( \sqrt{ S^2 A H^4 } K^{2/3} ) $ regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.

ICML Conference 2020 Conference Paper

Topic Modeling via Full Dependence Mixtures

  • Dan Fisher
  • Mark Kozdoba
  • Shie Mannor

In this paper we introduce a new approach to topic modelling that scales to large datasets by using a compact representation of the data and by leveraging the GPU architecture. In this approach, topics are learned directly from the co-occurrence data of the corpus. In particular, we introduce a novel mixture model which we term the Full Dependence Mixture (FDM) model. FDMs model second moment under general generative assumptions on the data. While there is previous work on topic modeling using second moments, we develop a direct stochastic optimization procedure for fitting an FDM with a single Kullback Leibler objective. Moment methods in general have the benefit that an iteration no longer needs to scale with the size of the corpus. Our approach allows us to leverage standard optimizers and GPUs for the problem of topic modeling. In particular, we evaluate the approach on two large datasets, NeurIPS papers and a Twitter corpus, with a large number of topics, and show that the approach performs comparably or better than the standard benchmarks.

UAI Conference 2019 Conference Paper

A Bayesian Approach to Robust Reinforcement Learning

  • Esther Derman
  • Daniel J. Mankowitz
  • Timothy A. Mann
  • Shie Mannor

Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach. We introduce the Uncertainty Robust Bellman Equation (URBE) which encourages safe exploration for adapting the uncertainty set to new observations while preserving robustness. We propose a URBE-based algorithm, DQN-URBE, that scales this method to higher dimensional domains. Our experiments show that the derived URBE-based strategy leads to a better trade-off between less conservative solutions and robustness in the presence of model misspecification. In addition, we show that the DQN-URBE algorithm can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets.

ICML Conference 2019 Conference Paper

Action Robust Reinforcement Learning and Applications in Continuous Control

  • Chen Tessler
  • Yonathan Efroni
  • Shie Mannor

A policy is said to be robust if it maximizes the reward while considering a bad, or even adversarial, model. In this work we formalize two new criteria of robustness to action uncertainty. Specifically, we consider two scenarios in which the agent attempts to perform an action $\action$, and (i) with probability $\alpha$, an alternative adversarial action $\bar \action$ is taken, or (ii) an adversary adds a perturbation to the selected action in the case of continuous action space. We show that our criteria are related to common forms of uncertainty in robotics domains, such as the occurrence of abrupt forces, and suggest algorithms in the tabular case. Building on the suggested algorithms, we generalize our approach to deep reinforcement learning (DRL) and provide extensive experiments in the various MuJoCo domains. Our experiments show that not only does our approach produce robust policies, but it also improves the performance in the absence of perturbations. This generalization indicates that action-robustness can be thought of as implicit regularization in RL problems.

RLDM Conference 2019 Conference Abstract

Action Robust Reinforcement Learning and Applications in Continuous Control

  • Chen Tessler
  • Yonathan Efroni
  • Shie Mannor

A policy is said to be robust if it maximizes the reward while considering a bad, or even adver- sarial, model. In this work we formalize a new criterion of robustness to action uncertainty. Specifically, we consider a scenario in which the agent attempts to perform an action a, and with probability α, an alter- native adversarial action ā is taken. We show that our criterion is related to common forms of uncertainty in robotics domains, such as the occurrence of abrupt forces, and suggest algorithms in the tabular case. Building on the suggested algorithms, we generalize our approach to deep reinforcement learning (DRL) and provide extensive experiments in the various MuJoCo domains. Our experiments show that not only does our approach produce robust policies, but it also improves the performance in the absence of perturba- tions. This generalization indicates that action-robustness can be thought of as implicit regularization in RL problems.

RLDM Conference 2019 Conference Abstract

Distributed Q-learning with Gittins Prioritization

  • Jhonathan Osin
  • Naama Pearl
  • Tom Zahavy
  • Shie Mannor

We consider a distributed reinforcement learning framework where multiple agents interact with the environment in parallel, while sharing experience, in order to find the optimal policy. At each time step, only a sub set of the agents is selected to interact with the environment. We explore several mechanisms for selecting which agents to prioritize based on the reward and the TD-error, and analyze their effect on the learning process. When the model is known, the optimal prioritization policy is the Gittins index. We propose an algorithm for learning the Gittins index from demonstrations and show that it yields an −optimal Gittins policy. Simulations in tabular MDPs show that prioritization significantly improves the sample complexity.

NeurIPS Conference 2019 Conference Paper

Distributional Policy Optimization: An Alternative Approach for Continuous Control

  • Chen Tessler
  • Guy Tennenholtz
  • Shie Mannor

We identify a fundamental problem in policy gradient-based methods in continuous control. As policy gradient methods require the agent's underlying probability distribution, they limit policy representation to parametric distribution classes. We show that optimizing over such sets results in local movement in the action space and thus convergence to sub-optimal solutions. We suggest a novel distributional framework, able to represent arbitrary distribution functions over the continuous action space. Using this framework, we construct a generative scheme, trained using an off-policy actor-critic paradigm, which we call the Generative Actor Critic (GAC). Compared to policy gradient methods, GAC does not require knowledge of the underlying probability distribution, thereby overcoming these limitations. Empirical evaluation shows that our approach is comparable and often surpasses current state-of-the-art baselines in continuous domains.

ICML Conference 2019 Conference Paper

Exploration Conscious Reinforcement Learning Revisited

  • Lior Shani
  • Yonathan Efroni
  • Shie Mannor

The Exploration-Exploitation tradeoff arises in Reinforcement Learning when one cannot tell if a policy is optimal. Then, there is a constant need to explore new actions instead of exploiting past experience. In practice, it is common to resolve the tradeoff by using a fixed exploration mechanism, such as $\epsilon$-greedy exploration or by adding Gaussian noise, while still trying to learn an optimal policy. In this work, we take a different approach and study exploration-conscious criteria, that result in optimal policies with respect to the exploration mechanism. Solving these criteria, as we establish, amounts to solving a surrogate Markov Decision Process. We continue and analyze properties of exploration-conscious optimal policies and characterize two general approaches to solve such criteria. Building on the approaches, we apply simple changes in existing tabular and deep Reinforcement Learning algorithms and empirically demonstrate superior performance relatively to their non-exploration-conscious counterparts, both for discrete and continuous action spaces.

AAAI Conference 2019 Conference Paper

How to Combine Tree-Search Methods in Reinforcement Learning

  • Yonathan Efroni
  • Gal Dalal
  • Bruno Scherrer
  • Shie Mannor

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e. g. in AlphaZero (Silver et al. 2017b)). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach. Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a γh -contracting procedure, where γ is the discount factor and h is the tree depth. To establish our results, we first introduce a notion called multiple-step greedy consistency. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.

RLDM Conference 2019 Conference Abstract

Inverse Reinforcement Learning in Contextual MDPs

  • Philip Korsunsky
  • Stav Belogolovsky
  • Tom Zahavy
  • Chen Tessler
  • Shie Mannor

We consider the Inverse Reinforcement Learning (IRL) problem in Contextual Markov Decision Processes (CMDPs). Here, the reward of the environment depends on a hidden static parameter referred to as the context, i. e. , each context defines an MDP. The agent does not observe the reward, but instead, it is provided with expert demonstrations for each context. The goal of the agent is to learn a mapping from contexts to rewards that will guarantee performance which is similar to that of the expert on unseen contexts. We suggest two methods for learning in this scenario. (1) For rewards that are a linear function of the context, we provide a method that is guaranteed to return an -optimal solution after a polynomial number of demonstrations. (2) For general reward functions, we propose a black-box optimization method. We test our methods in an autonomous driving simulation and demonstrate their ability to learn and generalize to unseen contexts.

ICML Conference 2019 Conference Paper

Nonlinear Distributional Gradient Temporal-Difference Learning

  • Chao Qu
  • Shie Mannor
  • Huan Xu 0001

We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{é}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexity of above three algorithms is linear w. r. t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.

AAAI Conference 2019 Conference Paper

On-Line Learning of Linear Dynamical Systems: Exponential Forgetting in Kalman Filters

  • Mark Kozdoba
  • Jakub Marecek
  • Tigran Tchrakian
  • Shie Mannor

The Kalman filter is a key tool for time-series forecasting and analysis. We show that the dependence of a prediction of Kalman filter on the past is decaying exponentially, whenever the process noise is non-degenerate. Therefore, Kalman filter may be approximated by regression on a few recent observations. Surprisingly, we also show that having some process noise is essential for the exponential decay. With no process noise, it may happen that the forecast depends on all of the past uniformly, which makes forecasting more difficult. Based on this insight, we devise an on-line algorithm for improper learning of a linear dynamical system (LDS), which considers only a few most recent observations. We use our decay results to provide the first regret bounds w. r. t. to Kalman filters within learning an LDS. That is, we compare the results of our algorithm to the best, in hindsight, Kalman filter for a given signal. Also, the algorithm is practical: its per-update run-time is linear in the regression depth.

RLDM Conference 2019 Conference Abstract

Sparse Imitation Learning for Text Based Games with Combinatorial Action Spaces

  • Chen Tessler
  • Tom Zahavy
  • Deborah Cohen
  • Shie Mannor

We propose a computationally efficient algorithm that combines compressed sensing with imita- tion learning to solve sequential decision making text-based games with combinatorial action spaces. To do so, we derive a variation of the compressed sensing algorithm Orthogonal Matching Pursuit (OMP), that we call IK-OMP, and show that it can recover a bag-of-words from a sum of the individual word embeddings, even in the presence of noise. We incorporate IK-OMP into a supervised imitation learning setting and show that this algorithm, called Sparse Imitation Learning (Sparse-IL), solves the entire text-based game of Zork1 with an action space of approximately 10 million actions using imperfect, noisy demonstrations.

RLDM Conference 2019 Conference Abstract

The Natural Language of Actions

  • Guy Tennenholtz
  • Shie Mannor

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings on a high dimensional navigation task and the large action space domain of StarCraft II.

ICML Conference 2019 Conference Paper

The Natural Language of Actions

  • Guy Tennenholtz
  • Shie Mannor

We introduce Act2Vec, a general framework for learning context-based action representation for Reinforcement Learning. Representing actions in a vector space help reinforcement learning algorithms achieve better performance by grouping similar actions and utilizing relations between different actions. We show how prior knowledge of an environment can be extracted from demonstrations and injected into action vector representations that encode natural compatible behavior. We then use these for augmenting state representations as well as improving function approximation of Q-values. We visualize and test action embeddings in three domains including a drawing task, a high dimensional navigation task, and the large action space domain of StarCraft II.

NeurIPS Conference 2019 Conference Paper

Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

  • Yonathan Efroni
  • Nadav Merlis
  • Mohammad Ghavamzadeh
  • Shie Mannor

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i. e. , by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, O(\sqrt{HSAT}). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.

NeurIPS Conference 2019 Conference Paper

Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

  • Chao Qu
  • Shie Mannor
  • Huan Xu
  • Yuan Qi
  • Le Song
  • Junwu Xiong

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume each agent is located at a node of a communication network and can exchange information only with its neighbors. Using softmax temporal consistency, we derive a primal-dual decentralized optimization method and obtain a principled and data-efficient iterative algorithm named {\em value propagation}. We prove a non-asymptotic convergence rate of $\mathcal{O}(1/T)$ with nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with a convergence guarantee in the control, off-policy, non-linear function approximation, fully decentralized setting.

ICML Conference 2018 Conference Paper

Beyond the One-Step Greedy Approach in Reinforcement Learning

  • Yonathan Efroni
  • Gal Dalal
  • Bruno Scherrer
  • Shie Mannor

The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e. g, n-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.

EWRL Workshop 2018 Workshop Paper

Convergence of Online and Approximate Multiple-Step Lookahead Policy Iteration

  • Yonathan Efroni
  • Gal Dalal
  • Bruno Scherrer
  • Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work Efroni et al. (2018), multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiplestep greedy algorithms in more practical setups. We begin by highlighting a counterintuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.

AAAI Conference 2018 Conference Paper

Finite Sample Analyses for TD(0) With Function Approximation

  • Gal Dalal
  • Balázs Szörényi
  • Gugan Thoppe
  • Shie Mannor

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Works that managed to obtain convergence rates for online Temporal Difference (TD) methods analyzed somewhat modified versions of them that include projections and stepsize dependent on unknown problem parameters. Our analysis obviates these artificial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. Both are based on relatively unknown, recently developed stochastic approximation techniques.

AAAI Conference 2018 Conference Paper

Is a Picture Worth a Thousand Words? A Deep Multi-Modal Architecture for Product Classification in E-Commerce

  • Tom Zahavy
  • Abhinandan Krishnan
  • Alessandro Magnani
  • Shie Mannor

Classifying products precisely and efficiently is a major challenge in modern e-commerce. The high traffic of new products uploaded daily and the dynamic nature of the categories raise the need for machine learning models that can reduce the cost and time of human editors. In this paper, we propose a decision level fusion approach for multi-modal product classification based on text and image neural network classifiers. We train input specific state-of-the-art deep neural networks for each input source, show the potential of forging them together into a multi-modal architecture and train a novel policy network that learns to choose between them. Finally, we demonstrate that our multi-modal network improves classi- fication accuracy over both networks on a real-world largescale product classification dataset that we collected from Walmart. com. While we focus on image-text fusion that characterizes e-commerce businesses, our algorithms can be easily applied to other modalities such as audio, video, physical sensors, etc.

NeurIPS Conference 2018 Conference Paper

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

  • Tom Zahavy
  • Matan Haroush
  • Nadav Merlis
  • Daniel Mankowitz
  • Shie Mannor

Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.

EWRL Workshop 2018 Workshop Paper

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning

  • Tom Zahavy
  • Matan Haroush
  • Nadav Merlis
  • Daniel J. Mankowitz
  • Shie Mannor

Learning how to act when there are many available actions in each state is a challenging task for Reinforcement Learning (RL) agents, especially when many of the actions are redundant or irrelevant. In such cases, it is sometimes easier to learn which actions not to take. In this work, we propose the Action-Elimination Deep Q-Network (AE-DQN) architecture that combines a Deep RL algorithm with an Action Elimination Network (AEN) that eliminates sub-optimal actions. The AEN is trained to predict invalid actions, supervised by an external elimination signal provided by the environment. Simulations demonstrate a considerable speedup and added robustness over vanilla DQN in text-based games with over a thousand discrete actions.

AAAI Conference 2018 Conference Paper

Learning Robust Options

  • Daniel Mankowitz
  • Timothy Mann
  • Pierre-Luc Bacon
  • Doina Precup
  • Shie Mannor

Robust reinforcement learning aims to produce policies that have strong guarantees even in the face of environments/transition models whose parameters have strong uncertainty. Existing work uses value-based methods and the usual primitive action setting. In this paper, we propose robust methods for learning temporally abstract actions, in the framework of options. We present a Robust Options Policy Iteration (ROPI) algorithm with convergence guarantees, which learns options that are robust to model uncertainty. We utilize ROPI to learn robust options with the Robust Options Deep Q Network (RO-DQN) that solves multiple tasks and mitigates model misspecification due to model uncertainty. We present experimental results which suggest that policy iteration with linear features may have an inherent form of robustness when using coarse feature representations. In addition, we present experimental results which demonstrate that robustness helps policy iteration implemented on top of deep neural networks to generalize over a much broader range of dynamics than non-robust policy iteration.

NeurIPS Conference 2018 Conference Paper

Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning

  • Yonathan Efroni
  • Gal Dalal
  • Bruno Scherrer
  • Shie Mannor

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work (Efroni et al. , 2018), multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.

UAI Conference 2018 Conference Paper

Soft-Robust Actor-Critic Policy-Gradient

  • Esther Derman
  • Daniel J. Mankowitz
  • Timothy A. Mann
  • Shie Mannor

Robust Reinforcement Learning aims to derive an optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust ActorCritic algorithm (SR-AC). It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach on different domains by comparing it against regular learning methods and their robust formulations.

AAAI Conference 2017 Conference Paper

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

  • Chen Tessler
  • Shahar Givony
  • Tom Zahavy
  • Daniel Mankowitz
  • Shie Mannor

We propose a lifelong learning system that has the ability to reuse and transfer knowledge from one task to another while efficiently retaining the previously learned knowledgebase. Knowledge is transferred by learning reusable skills to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks, are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture using two techniques: (1) a deep skill array and (2) skill distillation, our novel variation of policy distillation (Rusu et al. 2015) for learning skills. Skill distillation enables the H- DRLN to efficiently retain knowledge and therefore scale in lifelong learning, by accumulating knowledge and encapsulating multiple reusable skills into a single distilled network. The H-DRLN exhibits superior performance and lower learning sample complexity compared to the regular Deep Q Network (Mnih et al. 2015) in sub-domains of Minecraft.

IJCAI Conference 2017 Conference Paper

Approximate Value Iteration with Temporally Extended Actions (Extended Abstract)

  • Timothy A. Mann
  • Shie Mannor
  • Doina Precup

The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state space. Next we consider generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at landmark states. We analyze OFVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

ICML Conference 2017 Conference Paper

Consistent On-Line Off-Policy Evaluation

  • Assaf Hallak
  • Shie Mannor

The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent Off-Policy Temporal Difference (COP-TD($\lambda$, $\beta$)) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COP-TD($\lambda$, $\beta$) can be designed to converge to the same value that would have been obtained by using on-policy TD($\lambda$) with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log-COP-TD($\lambda$, $\beta$). Both algorithms have favorable empirical results to the current state of the art on-line OPE algorithms. Finally, our formulation sheds some new light on the recently proposed Emphatic TD learning.

ICML Conference 2017 Conference Paper

End-to-End Differentiable Adversarial Imitation Learning

  • Nir Baram
  • Oron Anschel
  • Itai Caspi
  • Shie Mannor

Generative Adversarial Networks (GANs) have been successfully applied to the problem of policy imitation in a model-free setup. However, the computation graph of GANs, that include a stochastic policy as the generative model, is no longer differentiable end-to-end, which requires the use of high-variance gradient estimation. In this paper, we introduce the Model-based Generative Adversarial Imitation Learning (MGAIL) algorithm. We show how to use a forward model to make the computation fully differentiable, which enables training policies using the exact gradient of the discriminator. The resulting algorithm trains competent policies using relatively fewer expert samples and interactions with the environment. We test it on both discrete and continuous action domains and report results that surpass the state-of-the-art.

ICML Conference 2017 Conference Paper

Multi-objective Bandits: Optimizing the Generalized Gini Index

  • Róbert Busa-Fekete
  • Balázs Szörényi
  • Paul Weng
  • Shie Mannor

We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We propose an online gradient descent algorithm which exploits the convexity of the GGI aggregation function, and controls the exploration in a careful way achieving a distribution-free regret $\tilde{O}(T^{-1/2} )$ with high probability. We test our algorithm on synthetic data as well as on an electric battery control problem where the goal is to trade off the use of the different cells of a battery in order to balance their respective degradation rates.

NeurIPS Conference 2017 Conference Paper

Rotting Bandits

  • Nir Levine
  • Koby Crammer
  • Shie Mannor

The Multi-Armed Bandits (MAB) framework highlights the trade-off between acquiring new knowledge (Exploration) and leveraging available knowledge (Exploitation). In the classical MAB problem, a decision maker must choose an arm at each time step, upon which she receives a reward. The decision maker's objective is to maximize her cumulative expected reward over the time horizon. The MAB problem has been studied extensively, specifically under the assumption of the arms' rewards distributions being stationary, or quasi-stationary, over time. We consider a variant of the MAB framework, which we termed Rotting Bandits, where each arm's expected reward decays as a function of the number of times it has been pulled. We are motivated by many real-world scenarios such as online advertising, content recommendation, crowdsourcing, and more. We present algorithms, accompanied by simulations, and derive theoretical guarantees.

NeurIPS Conference 2017 Conference Paper

Shallow Updates for Deep Reinforcement Learning

  • Nir Levine
  • Tom Zahavy
  • Daniel Mankowitz
  • Aviv Tamar
  • Shie Mannor

Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy. Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.

EWRL Workshop 2016 Workshop Paper

A Deep Hierarchical Approach to Lifelong Learning in Minecraft

  • Chen Tessler
  • Shahar Givony
  • Daniel J. Mankowitz
  • Tom Zahavy
  • Shie Mannor

The ability to reuse or transfer knowledge from one task to another in lifelong learning problems, such as Minecraft, is one of the major challenges faced in AI. Reusing knowledge across tasks is crucial to solving tasks efficiently with lower sample complexity. We provide a Reinforcement Learning agent with the ability to transfer knowledge by learning reusable skills, a type of temporally extended action (also known as Options (Sutton et. al. 1999)). The agent learns reusable skills to solve tasks in Minecraft, a popular video game which is an unsolved and high-dimensional lifelong learning problem. These reusable skills, which we refer to as Deep Skill Networks (DSNs), are then incorporated into our novel Hierarchical Deep Reinforcement Learning Network (H-DRLN) architecture. The H-DRLN, a hierarchical extension of Deep Q-Networks, learns to efficiently solve tasks by reusing knowledge from previously learned DSNs. The DSNs are incorporated into the H-DRLN using two techniques: (1) a DSN array and (2) skill distillation, our novel variation of policy distillation (Rusu et al., 2015) for learning skills. Skill distillation enables the H-DRLN to scale in lifelong learning, by accumulating knowledge and encapsulating multiple reusable skills into a single distilled network. The H-DRLN exhibits superior performance and lower learning sample complexity (by taking advantage of temporally extended actions) compared to the regular Deep Q Network (Mnih et. al. 2015) in sub-domains of Minecraft. We also show the potential to transfer knowledge between related Minecraft tasks without any additional learning.

NeurIPS Conference 2016 Conference Paper

Adaptive Skills Adaptive Partitions (ASAP)

  • Daniel Mankowitz
  • Timothy Mann
  • Shie Mannor

We introduce the Adaptive Skills, Adaptive Partitions (ASAP) framework that (1) learns skills (i. e. , temporally extended actions or options) as well as (2) where to apply them. We believe that both (1) and (2) are necessary for a truly general skill learning framework, which is a key building block needed to scale up to lifelong learning agents. The ASAP framework is also able to solve related new tasks simply by adapting where it applies its existing learned skills. We prove that ASAP converges to a local optimum under natural conditions. Finally, our experimental results, which include a RoboCup domain, demonstrate the ability of ASAP to learn where to reuse skills as well as solve multiple tasks with considerably less experience than solving each task from scratch.

EWRL Workshop 2016 Workshop Paper

Consistent On-Line Off-Policy Evaluation

  • Assaf Hallak
  • Shie Mannor

The problem of on-line off-policy evaluation (OPE) has been actively studied in the last decade due to its importance both as a stand-alone problem and as a sub-module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when linear function approximation is applied. In this paper we propose Consistent Off-Policy TD (COP-TD) that addresses this issue directly and enables reducing this bias at some computational expense. We show that COP-TD(λ, β) can be designed to converge to the same value that would have been obtained by using on-policy TD(λ) with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call log-COP-TD(λ, β). Both algorithms have favorable empirical results to the current state of the art on-line OPE algorithms.

AAAI Conference 2016 Conference Paper

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

  • Assaf Hallak
  • Aviv Tamar
  • Remi Munos
  • Shie Mannor

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importancesampling term. We study conditions under which the projected fixed-point equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

ICML Conference 2016 Conference Paper

Graying the black box: Understanding DQNs

  • Tom Zahavy
  • Nir Ben-Zrihem
  • Shie Mannor

In recent years there is a growing interest in using deep representations for reinforcement learning. In this paper, we present a methodology and tools to analyze Deep Q-networks (DQNs) in a non-blind matter. Using our tools we reveal that the features learned by DQNs aggregate the state space in a hierarchical fashion, explaining its success. Moreover we are able to understand and describe the policies learned by DQNs for three different Atari2600 games and suggest ways to interpret, debug and optimize of deep neural networks in Reinforcement Learning.

ICML Conference 2016 Conference Paper

Heteroscedastic Sequences: Beyond Gaussianity

  • Oren Anava
  • Shie Mannor

We address the problem of sequential prediction in the heteroscedastic setting, when both the signal and its variance are assumed to depend on explanatory variables. By applying regret minimization techniques, we devise an efficient online learning algorithm for the problem, without assuming that the error terms comply with a specific distribution. We show that our algorithm can be adjusted to provide confidence bounds for its predictions, and provide an application to ARCH models. The theoretic results are corroborated by an empirical study.

ICML Conference 2016 Conference Paper

Hierarchical Decision Making In Electricity Grid Management

  • Gal Dalal
  • Elad Gilboa
  • Shie Mannor

The power grid is a complex and vital system that necessitates careful reliability management. Managing the grid is a difficult problem with multiple time scales of decision making and stochastic behavior due to renewable energy generations, variable demand and unplanned outages. Solving this problem in the face of uncertainty requires a new methodology with tractable algorithms. In this work, we introduce a new model for hierarchical decision making in complex systems. We apply reinforcement learning (RL) methods to learn a proxy, i. e. , a level of abstraction, for real-time power grid reliability. We devise an algorithm that alternates between slow time-scale policy improvement, and fast time-scale value function approximation. We compare our results to prevailing heuristics, and show the strength of our method.

EWRL Workshop 2016 Workshop Paper

Iterative Hierarchical Optimization for Misspecified Problems

  • Daniel J. Mankowitz
  • Timothy Mann
  • Shie Mannor

For complex, high-dimensional Markov Decision Processes (MDPs), it may be necessary to represent the policy with function approximation. A problem is misspecified whenever, the representation cannot express any policy with acceptable performance. We introduce IHOMP : an approach for solving misspecified problems. IHOMP iteratively learns a set of context specialized options and combines these options to solve an otherwise misspecified problem. Our main contribution is proving that IHOMP enjoys theoretical convergence guarantees. In addition, we extend IHOMP to exploit Option Interruption (OI) enabling it to decide where the learned options can be reused. Our experiments demonstrate that IHOMP can find near-optimal solutions to otherwise misspecified problems and that OI can further improve the solutions.

JMLR Journal 2016 Journal Article

Learning the Variance of the Reward-To-Go

  • Aviv Tamar
  • Dotan Di Castro
  • Shie Mannor

In Markov decision processes (MDPs), the variance of the reward- to-go is a natural measure of uncertainty about the long term performance of a policy, and is important in domains such as finance, resource allocation, and process control. Currently however, there is no tractable procedure for calculating it in large scale MDPs. This is in contrast to the case of the expected reward-to-go, also known as the value function, for which effective simulation-based algorithms are known, and have been used successfully in various domains. In this paper we extend temporal difference (TD) learning algorithms to estimating the variance of the reward-to- go for a fixed policy. We propose variants of both TD(0) and LSTD($\lambda$) with linear function approximation, prove their convergence, and demonstrate their utility in an option pricing problem. Our results show a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods, which are currently the state-of-the-art. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

EWRL Workshop 2016 Workshop Paper

Online Linear Programming with Unobserved Constraints

  • Wenzhuo Yang
  • Shie Mannor
  • Huan Xu

We consider online linear programming with unobserved constraints (LPUC) – a generalization of stochastic linear optimization – where in each round a learner chooses a solution and subsequently receives some feedback about the feasibility of the selected solution w. r. t. the unknown constraints, e. g. , indicating which constraint is violated or how much the solution deviates from the feasibility set. To tackle this problem, we develop two algorithms, namely, LPUC-ED based on the epsilon-decreasing strategy and LPUC-UCB based on the upper confidence bound strategy, and derive finite time bounds on the regret and the constraint violation.

JMLR Journal 2016 Journal Article

Regularized Policy Iteration with Nonparametric Function Spaces

  • Amir-massoud Farahmand
  • Mohammad Ghavamzadeh
  • Csaba Szepesvári
  • Shie Mannor

We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least- Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms' policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm. (This work is an extension of the NIPS 2008 conference paper by Farahmand et al. (2009b).) [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

EWRL Workshop 2016 Workshop Paper

Robust Kalman Temporal Difference

  • Shirli Di-Castro Shashua
  • Shie Mannor

We propose an on-line algorithm for policy evaluation in large scale Robust Markov Decision Processes (RMDPs) with uncertainty in the transition probabilities. Our approach is based on the Kalman-Temporal Difference (KTD) formulation, supporting linear and non-linear approximations and considering minimal conditions on the uncertain transition probabilities. Previous work deals with robustness using dynamic programming (DP) and approximate dynamic programming (ADP) methods for both small and large state spaces. These methods can be used only in an off-line setting that requires full trajectories information. In large scale state spaces, the convergence proof is based on a restricted assumption regarding the uncertainty set and only linear value function approximation is considered. Our approach overcomes these limitations by using the Kalman filter framework for on-line estimation and considering the robust Bellman equation as an observation function. We present the Robust-KTD algorithm, analyze its convergence and examine its performance.

EWRL Workshop 2016 Workshop Paper

Situational Awareness by Risk-Conscious Skills

  • Daniel J. Mankowitz
  • Aviv Tamar
  • Shie Mannor

Hierarchical Reinforcement Learning has been previously shown to speed up the convergence rate of RL planning algorithms as well as mitigate feature-based model misspecification (Mankowitz et al., 2016a,b; Bacon and Precup, 2015). To do so, it utilizes hierarchical abstractions, also known as skills – a type of temporally extended action (Sutton et al., 1999) to plan at a higher level, abstracting away from the lower-level details. We incorporate risk sensitivity, also referred to as Situational Awareness (SA), into hierarchical RL for the first time by defining and learning risk aware skills in a Probabilistic Goal Semi-Markov Decision Process (PG-SMDP). This is achieved using our novel Situational Awareness by Risk-Conscious Skills (SARiCoS) algorithm which comes with a theoretical convergence guarantee. We show in a RoboCup soccer domain that the learned risk aware skills exhibit complex human behaviors such as ‘time-wasting’ in a soccer game. In addition, the learned risk aware skills are able to mitigate reward-based model misspecification.

EWRL Workshop 2016 Workshop Paper

Spatio-Temporal Abstractions in Reinforcement Learning Through Neural Encoding

  • Nir Baram
  • Tom Zahavy
  • Shie Mannor

Recent progress in the field of Reinforcement Learning (RL) has enabled to tackle bigger and more challenging tasks. However, the increasing complexity of the problems, as well as the use of more sophisticated models such as Deep Neural Networks (DNN), impedes the understanding of artificial agents behavior. In this work, we present the Semi-Aggregated Markov Decision Process (SAMDP) model. The purpose of the SAMDP modeling is to describe and allow a better understanding of complex behaviors by identifying temporal and spatial abstractions. In contrast to other modeling approaches, SAMDP is built in a transformed state-space that encodes the dynamics of the problem. We show that working with the right state representation mitigates the problem of finding spatial and temporal abstractions. We describe the process of building the SAMDP model from observed trajectories and give examples for using it in a toy problem and complicated DQN policies. Finally, we show how using the SAMDP we can monitor the policy at hand and make it more robust.

RLDM Conference 2015 Conference Abstract

Actively Learning to Attract Followers on Twitter

  • Nir Levine
  • Shie Mannor
  • Timothy Mann

Twitter, a popular social network, presents great opportunities for on-line machine learning re- search. However, previous research has focused almost entirely on learning from passively collected data. We study the problem of learning to acquire followers through normative user behavior, as opposed to the mass following policies applied by many bots. We formalize the problem as a contextual bandit problem, in which we consider retweeting content to be the action chosen and each tweet (content) is accompanied by context. We design reward signals based on the change in followers. The result of our month long experi- ment with 60 agents suggests that (1) aggregating experience across agents can adversely impact prediction accuracy and (2) the Twitter community’s response to different actions is non-stationary. Our findings sug- gest that actively learning on-line can provide deeper insights about how to attract followers than machine learning over passively collected data alone.

JAIR Journal 2015 Journal Article

Approximate Value Iteration with Temporally Extended Actions

  • Timothy A. Mann
  • Shie Mannor
  • Doina Precup

Temporally extended actions have proven useful for reinforcement learning, but their duration also makes them valuable for efficient planning. The options framework provides a concrete way to implement and reason about temporally extended actions. Existing literature has demonstrated the value of planning with options empirically, but there is a lack of theoretical analysis formalizing when planning with options is more efficient than planning with primitive actions. We provide a general analysis of the convergence rate of a popular Approximate Value Iteration (AVI) algorithm called Fitted Value Iteration (FVI) with options. Our analysis reveals that longer duration options and a pessimistic estimate of the value function both lead to faster convergence. Furthermore, options can improve convergence even when they are suboptimal and sparsely distributed throughout the state-space. Next we consider the problem of generating useful options for planning based on a subset of landmark states. This suggests a new algorithm, Landmark-based AVI (LAVI), that represents the value function only at the landmark states. We analyze both FVI and LAVI using the proposed landmark-based options and compare the two algorithms. Our experimental results in three different domains demonstrate the key properties from the analysis. Our theoretical and experimental results demonstrate that options can play an important role in AVI by decreasing approximation error and inducing fast convergence.

NeurIPS Conference 2015 Conference Paper

Community Detection via Measure Space Embedding

  • Mark Kozdoba
  • Shie Mannor

We present a new algorithm for community detection. The algorithm uses random walks to embed the graph in a space of measures, after which a modification of $k$-means in that space is applied. The algorithm is therefore fast and easily parallelizable. We evaluate the algorithm on standard random graph benchmarks, including some overlapping community benchmarks, and find its performance to be better or at least as good as previously known algorithms. We also prove a linear time (in number of edges) guarantee for the algorithm on a $p, q$-stochastic block model with where $p \geq c\cdot N^{-\half + \epsilon}$ and $p-q \geq c' \sqrt{p N^{-\half + \epsilon} \log N}$.

EWRL Workshop 2015 Workshop Paper

Contextual Markov Decision Processes

  • Assaf Hallak
  • Dotan Di Castro
  • Shie Mannor

We consider a planning problem where the dynamics and rewards of the environment depend on a hidden static parameter referred to as the context. The objective is to learn a strategy that maximizes the accumulated reward across all contexts. The new model, called Contextual Markov Decision Process (CMDP), can model a customer’s behavior when interacting with a website. The customer’s behavior depends on gender, age, location, device, etc. Based on that behavior, the website objective is to determine customer characteristics, and to optimize the interaction between them. Our work focuses on one basic scenario–finite horizon with a small number of possible contexts. We suggest a family of algorithms with provable guarantees that learn the underlying models and the latent contexts, and optimize the CMDPs. Bounds are obtained for specific naive implementations, and extensions of the framework are discussed, laying the ground for future research.

ICML Conference 2015 Conference Paper

Dynamic Sensing: Better Classification under Acquisition Constraints

  • Oran Richman
  • Shie Mannor

In many machine learning applications the quality of the data is limited by resource constraints (may it be power, bandwidth, memory, .. .). In such cases, the constraints are on the average resources allocated, therefore there is some control over each sample’s quality. In most cases this option remains unused and the data’s quality is uniform over the samples. In this paper we propose to actively allocate resources to each sample such that resources are used optimally overall. We propose a method to compute the optimal resource allocation. We further derive generalization bounds for the case where the problem’s model is unknown. We demonstrate the potential benefit of this approach on both simulated and real-life problems.

EWRL Workshop 2015 Workshop Paper

Learning to coordinate without communication in multi-user multi-armed bandit problems

  • Orly Avner
  • Shie Mannor

We consider a setting where multiple users share multiple channels modeled as a multi-user multi-armed bandit (MAB) problem. The characteristics of each channel are initially unknown and may differ between the users. Each user can choose between the channels, but her success depends on the particular channel as well as on the selections of other users: if two users select the same channel their messages collide and none of them manages to send any data. Our setting is fully distributed, so there is no central control and every user only observes the channel she currently uses. As in many communication systems such as cognitive radio networks, the users cannot communicate among themselves so coordination must be achieved without direct communication. We develop algorithms for learning a stable configuration for the multiple user MAB problem. We further offer both convergence guarantees and experiments inspired by real communication networks.

EWRL Workshop 2015 Workshop Paper

Off-policy Model-based Learning under Unknown Factored Dynamics

  • Assaf Hallak
  • Francois Schnitzler
  • Timothy Mann
  • Shie Mannor

Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. We present a finite sample analysis of our approach and show through experiments that the algorithm scales well on high-dimensional problems with few samples.

ICML Conference 2015 Conference Paper

Off-policy Model-based Learning under Unknown Factored Dynamics

  • Assaf Hallak
  • François Schnitzler
  • Timothy A. Mann
  • Shie Mannor

Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. We present a finite sample analysis of our approach and show through experiments that the algorithm scales well on high-dimensional problems with few samples.

NeurIPS Conference 2015 Conference Paper

Online Learning for Adversaries with Memory: Price of Past Mistakes

  • Oren Anava
  • Elad Hazan
  • Shie Mannor

The framework of online learning with memory naturally captures learning problems with temporal effects, and was previously studied for the experts setting. In this work we extend the notion of learning with memory to the general Online Convex Optimization (OCO) framework, and present two algorithms that attain low regret. The first algorithm applies to Lipschitz continuous loss functions, obtaining optimal regret bounds for both convex and strongly convex losses. The second algorithm attains the optimal regret bounds and applies more broadly to convex losses without requiring Lipschitz continuity, yet is more complicated to implement. We complement the theoretic results with two applications: statistical arbitrage in finance, and multi-step ahead prediction in statistics.

AAAI Conference 2015 Conference Paper

Optimizing the CVaR via Sampling

  • Aviv Tamar
  • Yonatan Glassner
  • Shie Mannor

Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains. We develop a new formula for the gradient of the CVaR in the form of a conditional expectation. Based on this formula, we propose a novel sampling-based estimator for the gradient of the CVaR, in the spirit of the likelihood-ratio method. We analyze the bias of the estimator, and prove the convergence of a corresponding stochastic gradient descent algorithm to a local CVaR optimum. Our method allows to consider CVaR optimization in new domains. As an example, we consider a reinforcement learning application, and learn a risksensitive controller for the game of Tetris.

NeurIPS Conference 2015 Conference Paper

Policy Gradient for Coherent Risk Measures

  • Aviv Tamar
  • Yinlam Chow
  • Mohammad Ghavamzadeh
  • Shie Mannor

Several authors have recently developed risk-sensitive policy gradient methods that augment the standard expected cost minimization problem with a measure of variability in cost. These studies have focused on specific risk-measures, such as the variance or conditional value at risk (CVaR). In this work, we extend the policy gradient method to the whole class of coherent risk measures, which is widely accepted in finance and operations research, among other fields. We consider both static and time-consistent dynamic risk measures. For static risk measures, our approach is in the spirit of policy gradient algorithms and combines a standard sampling approach with convex programming. For dynamic risk measures, our approach is actor-critic style and involves explicit approximation of value function. Most importantly, our contribution presents a unified approach to risk-sensitive reinforcement learning that generalizes and extends previous results.

NeurIPS Conference 2015 Conference Paper

Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach

  • Yinlam Chow
  • Aviv Tamar
  • Shie Mannor
  • Marco Pavone

In this paper we address the problem of decision making within a Markov decision process (MDP) framework where risk and modeling errors are taken into account. Our approach is to minimize a risk-sensitive conditional-value-at-risk (CVaR) objective, as opposed to a standard risk-neutral expectation. We refer to such problem as CVaR MDP. Our first contribution is to show that a CVaR objective, besides capturing risk sensitivity, has an alternative interpretation as expected cost under worst-case modeling errors, for a given error budget. This result, which is of independent interest, motivates CVaR MDPs as a unifying framework for risk-sensitive and robust decision making. Our second contribution is to present a value-iteration algorithm for CVaR MDPs, and analyze its convergence rate. To our knowledge, this is the first solution algorithm for CVaR MDPs that enjoys error guarantees. Finally, we present results from numerical experiments that corroborate our theoretical findings and show the practicality of our approach.

ICML Conference 2014 Conference Paper

Concept Drift Detection Through Resampling

  • Maayan Harel
  • Shie Mannor
  • Ran El-Yaniv
  • Koby Crammer

Detecting changes in data-streams is an important part of enhancing learning quality in dynamic environments. We devise a procedure for detecting concept drifts in data-streams that relies on analyzing the empirical loss of learning algorithms. Our method is based on obtaining statistics from the loss distribution by reusing the data multiple times via resampling. We present theoretical guarantees for the proposed procedure based on the stability of the underlying learning algorithms. Experimental results show that the detection method has high recall and precision, and performs well in the presence of noise.

NeurIPS Conference 2014 Conference Paper

How hard is my MDP?" The distribution-norm to the rescue"

  • Odalric-Ambrym Maillard
  • Timothy Mann
  • Shie Mannor

In Reinforcement Learning (RL), state-of-the-art algorithms require a large number of samples per state-action pair to estimate the transition kernel $p$. In many problems, a good approximation of $p$ is not needed. For instance, if from one state-action pair $(s, a)$, one can only transit to states with the same value, learning $p(\cdot|s, a)$ accurately is irrelevant (only its support matters). This paper aims at capturing such behavior by defining a novel hardness measure for Markov Decision Processes (MDPs) we call the {\em distribution-norm}. The distribution-norm w. r. t. ~a measure $\nu$ is defined on zero $\nu$-mean functions $f$ by the standard variation of $f$ with respect to $\nu$. We first provide a concentration inequality for the dual of the distribution-norm. This allows us to replace the generic but loose $||\cdot||_1$ concentration inequalities used in most previous analysis of RL algorithms, to benefit from this new hardness measure. We then show that several common RL benchmarks have low hardness when measured using the new norm. The distribution-norm captures finer properties than the number of states or the diameter and can be used to assess the difficulty of MDPs.

ICML Conference 2014 Conference Paper

Latent Bandits

  • Odalric-Ambrym Maillard
  • Shie Mannor

We consider a multi-armed bandit problem where the reward distributions are indexed by two sets –one for arms, one for type– and can be partitioned into a small number of clusters according to the type. First, we consider the setting where all reward distributions are known and all types have the same underlying cluster, the type’s identity is, however, unknown. Second, we study the case where types may come from different classes, which is significantly more challenging. Finally, we tackle the case where the reward distributions are completely unknown. In each setting, we introduce specific algorithms and derive non-trivial regret performance. Numerical experiments show that, in the most challenging agnostic case, the proposed algorithm achieves excellent performance in several difficult scenarios.

NeurIPS Conference 2014 Conference Paper

Robust Logistic Regression and Classification

  • Jiashi Feng
  • Huan Xu
  • Shie Mannor
  • Shuicheng Yan

We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called RoLR, that estimates the parameter through a simple linear programming procedure. We prove that RoLR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the first result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply RoLR to solving binary classification problems where a fraction of training samples are corrupted.

ICML Conference 2014 Conference Paper

Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations

  • Timothy A. Mann
  • Shie Mannor

We show how options, a class of control structures encompassing primitive and temporally extended actions, can play a valuable role in planning in MDPs with continuous state-spaces. Analyzing the convergence rate of Approximate Value Iteration with options reveals that for pessimistic initial value function estimates, options can speed up convergence compared to planning with only primitive actions even when the temporally extended actions are suboptimal and sparsely scattered throughout the state-space. Our experimental results in an optimal replacement task and a complex inventory management task demonstrate the potential for options to speed up convergence in practice. We show that options induce faster convergence to the optimal value function, which implies deriving better policies with fewer iterations.

ICML Conference 2014 Conference Paper

Scaling Up Robust MDPs using Function Approximation

  • Aviv Tamar
  • Shie Mannor
  • Huan Xu 0001

We consider large-scale Markov decision processes (MDPs) with parameter uncertainty, under the robust MDP paradigm. Previous studies showed that robust MDPs, based on a minimax approach to handling uncertainty, can be solved using dynamic programming for small to medium sized problems. However, due to the "curse of dimensionality", MDPs that model real-life problems are typically prohibitively large for such approaches. In this work we employ a reinforcement learning approach to tackle this planning problem: we develop a robust approximate dynamic programming method based on a projected fixed point equation to approximately solve large scale robust MDPs. We show that the proposed method provably succeeds under certain technical conditions, and demonstrate its effectiveness through simulation of an option pricing problem. To the best of our knowledge, this is the first attempt to scale up the robust MDP paradigm.

JMLR Journal 2014 Journal Article

Set-Valued Approachability and Online Learning with Partial Monitoring

  • Shie Mannor
  • Vianney Perchet
  • Gilles Stoltz

Approachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward: it belongs to a set rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop a simple and generally efficient strategy (i.e., with constant per-step complexity) for this setup. As an important example, we instantiate our general strategy to the case when external regret or internal regret is to be minimized under partial monitoring. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

ICML Conference 2014 Conference Paper

Thompson Sampling for Complex Online Problems

  • Aditya Gopalan
  • Shie Mannor
  • Yishay Mansour

We consider stochastic multi-armed bandit problems with complex actions over a set of basic arms, where the decision maker plays a complex action rather than a basic arm in each round. The reward of the complex action is some function of the basic arms’ rewards, and the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of the arms, we may only observe the maximum reward over the chosen subset. Thus, feedback across complex actions may be coupled due to the nature of the reward function. We prove a frequentist regret bound for Thompson sampling in a very general setting involving parameter, action and observation spaces and a likelihood function over them. The bound holds for discretely-supported priors over the parameter space and without additional structural properties such as closed-form posteriors, conjugate prior structure or independence across arms. The regret bound scales logarithmically with time but, more importantly, with an improved constant that non-trivially captures the coupling across complex actions due to the structure of the rewards. As applications, we derive improved regret bounds for classes of complex bandit problems involving selecting subsets of arms, including the first nontrivial regret bounds for nonlinear MAX reward feedback from subsets. Using particle filters for computing posterior distributions which lack an explicit closed-form, we present numerical results for the performance of Thompson sampling for subset-selection and job scheduling problems.

ICML Conference 2014 Conference Paper

Time-Regularized Interrupting Options (TRIO)

  • Timothy A. Mann
  • Daniel J. Mankowitz
  • Shie Mannor

High-level skills relieve planning algorithms from low-level details. But when the skills are poorly designed for the domain, the resulting plan may be severely suboptimal. Sutton et al. 1999 made an important step towards resolving this problem by introducing a rule that automatically improves a set of skills called options. This rule terminates an option early whenever switching to another option gives a higher value than continuing with the current option. However, they only analyzed the case where the improvement rule is applied once. We show conditions where this rule converges to the optimal set of options. A new Bellman-like operator that simultaneously improves the set of options is at the core of our analysis. One problem with the update rule is that it tends to favor lower-level skills. Therefore we introduce a regularization term that favors longer duration skills. Experimental results demonstrate that this approach can derive a good set of high-level skills even when the original set of skills cannot solve the problem.

RLDM Conference 2013 Conference Abstract

Complex Bandit Problems and Thompson Sampling

  • Aditya Gopalan
  • Shie Mannor
  • Yishay Mansour

We study stochastic multi-armed bandit settings with complex actions derived from the basic bandit arms, e. g. , subsets or partitions of basic arms. The decision maker is faced with selecting at each round a complex action instead of a basic arm. We allow the reward of the complex action to be some func- tion of the basic arms’ rewards, and so the feedback observed may not necessarily be the reward per-arm. For instance, when the complex actions are subsets of bandit arms, we may only observe the maximum reward over the chosen subset. Feedback from playing (complex) actions can thus be indicative of rewards from other actions, and leveraging this coupled feedback becomes important to the decision maker in or- der to learn efficiently. We propose applying Thompson Sampling – a Bayesian-inspired algorithm for the standard multi-armed bandit – for minimizing regret in complex bandit problems. We derive the first gener- al, frequentist regret bound for Thompson sampling in complex bandit settings, that holds without specific structural assumptions on the prior used by the algorithm. The regret bound exhibits the standard logarith- mic scaling with time but with a non-trivial multiplicative constant that encodes the coupled information structure of the complex bandit. As applications, we show improved regret bounds (compared to treating the complex actions as independent) for a class of complex, subset-selection bandit problems. Using particle filters for computing posterior distributions that often lack an explicit closed-form, we apply Thompson- sampling algorithms for subset selection and job-scheduling problems and present numerical results.

NeurIPS Conference 2013 Conference Paper

Learning Multiple Models via Regularized Weighting

  • Daniel Vainsencher
  • Shie Mannor
  • Huan Xu

We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd's algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i. e. , is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers.

NeurIPS Conference 2013 Conference Paper

Online PCA for Contaminated Data

  • Jiashi Feng
  • Huan Xu
  • Shie Mannor
  • Shuicheng Yan

We consider the online Principal Component Analysis (PCA) for contaminated samples (containing outliers) which are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily bad. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a $50\%$ breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data.

NeurIPS Conference 2013 Conference Paper

Reinforcement Learning in Robust Markov Decision Processes

  • Shiau Hong Lim
  • Huan Xu
  • Shie Mannor

An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case.

RLDM Conference 2013 Conference Abstract

Robust Sequential Decision Making

  • Shie Mannor

We consider planning problems where the parameters of the problems are not known. The robust approach to sequential decision making is to assume that the worst possible realization within a predefined uncertainty set will occur at every stage. While this approach is tractable, its pessimistic nature may lead to extremely conservative solutions. We will discuss several approaches that work-around the inherent conservativeness of the standard robust approach while remaining tractable. The proposed approaches also offer interesting probabilistic guarantees on the performance of the computed policy under a probabilistic deviation model.

ICML Conference 2013 Conference Paper

Robust Sparse Regression under Adversarial Corruption

  • Yudong Chen 0001
  • Constantine Caramanis
  • Shie Mannor

We consider high dimensional sparse regression with arbitrary – possibly, severe or coordinated – errors in the covariates matrix. We are interested in understanding how many corruptions we can tolerate, while identifying the correct support. To the best of our knowledge, neither standard outlier rejection techniques, nor recently developed robust regression algorithms (that focus only on corrupted response variables), nor recent algorithms for dealing with stochastic noise or erasures, can provide guarantees on support recovery. As we show, neither can the natural brute force algorithm that takes exponential time to find the subset of data and support columns, that yields the smallest regression error. We explore the power of a simple idea: replace the essential linear algebraic calculation – the inner product – with a robust counterpart that cannot be greatly affected by a controlled number of arbitrarily corrupted points: the trimmed inner product. We consider three popular algorithms in the uncorrupted setting: Thresholding Regression, Lasso, and the Dantzig selector, and show that the counterparts obtained using the trimmed inner product are provably robust.

ICML Conference 2013 Conference Paper

Temporal Difference Methods for the Variance of the Reward To Go

  • Aviv Tamar
  • Dotan Di Castro
  • Shie Mannor

In this paper we extend temporal difference policy evaluation algorithms to performance criteria that include the variance of the cumulative reward. Such criteria are useful for risk management, and are important in domains such as finance and process control. We propose variants of both TD(0) and LSTD(λ) with linear function approximation, prove their convergence, and demonstrate their utility in a 4-dimensional continuous state space problem.

RLDM Conference 2013 Conference Abstract

The Advantage of Planning with Options

  • Timothy Mann
  • Shie Mannor

Temporally extended actions or options have primarily been applied to speed up reinforcement learning by directing exploration to critical regions of the state space. We show that options may play a critical role in planning as well. To demonstrate this, we analyze the convergence rate of Fitted Value Iteration with options. Our analysis reveals that for pessimistic value function estimates, options can improve the convergence rate compared to Fitted Value Iteration with only primitive actions. Furthermore, options can improve convergence even when they are suboptimal. Our experimental results in two different domains demonstrate the key properties from the analysis. While previous research has primarily considered options as a tool for exploration, our theoretical and experimental results demonstrate that options can play an important role in planning.

NeurIPS Conference 2012 Conference Paper

The Perturbed Variation

  • Maayan Harel
  • Shie Mannor

We introduce a new discrepancy score between two distributions that gives an indication on their \emph{similarity}. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score's capacity to detect similarity with that of other known measures on real data.

NeurIPS Conference 2011 Conference Paper

Committing Bandits

  • Loc Bui
  • Ramesh Johari
  • Shie Mannor

We consider a multi-armed bandit problem where there are two phases. The first phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment $\Theta(\ln T)$ steps and then commit, where $T$ is the time horizon.

NeurIPS Conference 2011 Conference Paper

From Bandits to Experts: On the Value of Side-Observations

  • Shie Mannor
  • Ohad Shamir

We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known ``experts'' setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds.

IJCAI Conference 2011 Conference Paper

Probabilistic Goal Markov Decision Processes

  • Huan Xu
  • Shie Mannor

The Markov decision process model is a powerful tool in planing tasks and sequential decision making problems. The randomness of state transitions and rewards implies that the performance of a policy is often stochastic. In contrast to the standard approach that studies the expected performance, we consider the policy that maximizes the probability of achieving a pre-determined target performance, a criterion we term *probabilistic goal Markov decision processes*. We show that this problem is NP-hard, but can be solved using a pseudo-polynomial algorithm. We further consider a variant dubbed "chance-constraint Markov decision problems, " that treats the probability of achieving target performance as a constraint instead of the maximizing objective. This variant is NP-hard, but can be solved in pseudo-polynomial time.

JMLR Journal 2011 Journal Article

The Sample Complexity of Dictionary Learning

  • Daniel Vainsencher
  • Shie Mannor
  • Alfred M. Bruckstein

A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L 2 error in representation when the dictionary is used. For the case of l 1 regularized coefficient selection we provide a generalization bound of the order of O(√np ln(mλ)/m), where n is the dimension, p is the number of elements in the dictionary, λ is a bound on the l 1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O(√np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

AAAI Conference 2010 Conference Paper

Activity and Gait Recognition with Time-Delay Embeddings

  • Jordan Frank
  • Shie Mannor
  • Doina Precup

Activity recognition based on data from mobile wearable devices is becoming an important application area for machine learning. We propose a novel approach based on a combination of feature extraction using time-delay embedding and supervised learning. The computational requirements are considerably lower than existing approaches, so the processing can be done in real time on a low-powered portable device such as a mobile phone. We evaluate the performance of our algorithm on a large, noisy data set comprising over 50 hours of data from six different subjects, including activities such as running and walking up or down stairs. We also demonstrate the ability of the system to accurately classify an individual from a set of 25 people, based only on the characteristics of their walking gait. The system requires very little parameter tuning, and can be trained with small amounts of data.

NeurIPS Conference 2010 Conference Paper

Distributionally Robust Markov Decision Processes

  • Huan Xu
  • Shie Mannor

We consider Markov decision processes where the values of the parameters are uncertain. This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different confidence level so that a set of admissible probability distributions of the unknown parameters is specified. This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the confidence region of parameters abound. We propose a decision criterion based on distributional robustness: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i. e. , it agrees with the a-priori information). We show that finding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions.

NeurIPS Conference 2010 Conference Paper

Online Classification with Specificity Constraints

  • Andrey Bernstein
  • Shie Mannor
  • Nahum Shimkin

We consider the online binary classification problem, where we are given m classifiers. At each stage, the classifiers map the input to the probability that the input belongs to the positive class. An online classification meta-algorithm is an algorithm that combines the outputs of the classifiers in order to attain a certain goal, without having prior knowledge on the form and statistics of the input, and without prior knowledge on the performance of the given classifiers. In this paper, we use sensitivity and specificity as the performance metrics of the meta-algorithm. In particular, our goal is to design an algorithm which satisfies the following two properties (asymptotically): (i) its average false positive rate (fp-rate) is under some given threshold, and (ii) its average true positive rate (tp-rate) is not worse than the tp-rate of the best convex combination of the m given classifiers that satisfies fp-rate constraint, in hindsight. We show that this problem is in fact a special case of the regret minimization problem with constraints, and therefore the above goal is not attainable. Hence, we pose a relaxed goal and propose a corresponding practical online learning meta-algorithm that attains it. In the case of two classifiers, we show that this algorithm takes a very simple form. To our best knowledge, this is the first algorithm that addresses the problem of the average tp-rate maximization under average fp-rate constraints in the online setting.

JMLR Journal 2009 Journal Article

Online Learning with Sample Path Constraints

  • Shie Mannor
  • John N. Tsitsiklis
  • Jia Yuan Yu

We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We define the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature's choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2009 Journal Article

Robustness and Regularization of Support Vector Machines

  • Huan Xu
  • Constantine Caramanis
  • Shie Mannor

We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

EWRL Workshop 2008 Conference Paper

Efficient Reinforcement Learning in Parameterized Models: Discrete Parameter Case

  • Kirill Dyagilev
  • Shie Mannor
  • Nahum Shimkin

Abstract We consider reinforcement learning in the parameterized setup, where the model is known to belong to a finite set of Markov Decision Processes (MDPs) under the discounted return criteria. We propose an on-line algorithm for learning in such parameterized models, the Parameter Elimination (PEL) algorithm, and analyze its performance in terms of the total mistake bound criterion. The algorithm relies on Wald’s sequential probability ratio test to eliminate unlikely parameters, and uses an optimistic policy for effective exploration. We establish that, with high probability, the total mistake bound for the algorithm is linear (up to a logarithmic term) in the size of the parameter space, independently of the cardinality of the state and action spaces. We further demonstrate that much better dependence on is possible, depending on the specific information structure of the problem.

EWRL Workshop 2008 Conference Paper

Markov Decision Processes with Arbitrary Reward Processes

  • Jia Yuan Yu
  • Shie Mannor
  • Nahum Shimkin

Abstract We consider a control problem where the decision maker interacts with a standard Markov decision process with the exception that the reward functions vary arbitrarily over time. We extend the notion of Hannan consistency to this setting, showing that, in hindsight, the agent can perform almost as well as every deterministic policy. We present efficient online algorithms in the spirit of reinforcement learning that ensure that the agent’s performance loss, or regret, vanishes over time, provided that the environment is oblivious to the agent’s actions. However, counterexamples indicate that the regret does not vanish if the environment is not oblivious.

EWRL Workshop 2008 Conference Paper

Regularized Fitted Q-Iteration: Application to Planning

  • Amir Massoud Farahmand
  • Mohammad Ghavamzadeh
  • Csaba Szepesvári
  • Shie Mannor

Abstract We consider planning in a Markovian decision problem, i. e. , the problem of finding a good policy given access to a generative model of the environment. We propose to use fitted Q-iteration with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of controlling model-complexity. The algorithm is presented in detail for the case when the function space is a reproducing-kernel Hilbert space underlying a user-chosen kernel function. We derive bounds on the quality of the solution and argue that data-dependent penalties can lead to almost optimal performance. A simple example is used to illustrate the benefits of using a penalized procedure.

NeurIPS Conference 2008 Conference Paper

Regularized Policy Iteration

  • Amir Farahmand
  • Mohammad Ghavamzadeh
  • Shie Mannor
  • Csaba Szepesvári

In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions.

NeurIPS Conference 2008 Conference Paper

Robust Regression and Lasso

  • Huan Xu
  • Constantine Caramanis
  • Shie Mannor

We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to $\ell_1$ regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective.

JMLR Journal 2006 Journal Article

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

  • Eyal Even-Dar
  • Shie Mannor
  • Yishay Mansour

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (( n /ε 2 )log(1/δ)) times to find an ε-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

The Robustness-Performance Tradeoff in Markov Decision Processes

  • Huan Xu
  • Shie Mannor

Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the "optimal" tradeoff might be non-Markovian and hence is in general not tractable.

ICML Conference 2005 Conference Paper

The cross entropy method for classification

  • Shie Mannor
  • Dori Peleg
  • Reuven Y. Rubinstein

We consider support vector machines for binary classification. As opposed to most approaches we use the number of support vectors (the "L 0 norm") as a regularizing term instead of the L 1 or L 2 norms. In order to solve the optimization problem we use the cross entropy method to search over the possible sets of support vectors. The algorithm consists of solving a sequence of efficient linear programs. We report experiments where our method produces generalization errors that are similar to support vector machines, while using a considerably smaller number of support vectors.

JMLR Journal 2004 Journal Article

A Geometric Approach to Multi-Criterion Reinforcement Learning

  • Shie Mannor
  • Nahum Shimkin

We consider the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, actions that are observed but cannot be predicted beforehand. We capture this situation using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued. State recurrence conditions are imposed throughout. In our basic problem formulation, a desired target set is specified in the vector reward space, and the objective of the learning agent is to approach the target set, in the sense that the long-term average reward vector will belong to this set. We devise appropriate learning algorithms, that essentially use multiple reinforcement learning algorithms for the standard scalar reward problem, which are combined using the geometric insight from the theory of approachability for vector-valued stochastic games. We then address the more general and optimization-related problem, where a nested class of possible target sets is prescribed, and the goal of the learning agent is to approach the smallest possible target set (which will generally depend on the unknown system parameters). A particular case which falls into this framework is that of stochastic games with average reward constraints, and further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results. [abs] [ pdf ][ ps.gz ][ ps ]

JMLR Journal 2004 Journal Article

The Sample Complexity of Exploration in the Multi-Armed Bandit Problem (Special Topic on Learning Theory)

  • Shie Mannor
  • John N. Tsitsiklis

We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O (( n /ε 2 )log(1/δ)) trials suffices in order to find an ε-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/ε 2 )( n +log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ(( n /ε 2 )log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins. [abs] [ pdf ][ ps.gz ][ ps ]

JMLR Journal 2003 Journal Article

Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity

  • Shie Mannor
  • Ron Meir
  • Tong Zhang

Many regression and classification algorithms proposed over the years can be described as greedy procedures for the stagewise minimization of an appropriate cost function. Some examples include additive models, matching pursuit, and boosting. In this work we focus on the classification problem, for which many recent algorithms have been proposed and applied successfully. For a specific regularized form of greedy stagewise optimization, we prove consistency of the approach under rather general conditions. Focusing on specific classes of problems we provide conditions under which our greedy procedure achieves the (nearly) minimax rate of convergence, implying that the procedure cannot be improved in a worst case setting. We also construct a fully adaptive procedure, which, without knowing the smoothness parameter of the decision boundary, converges at the same rate as if the smoothness parameter were known. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2001 Conference Paper

The Steering Approach for Multi-Criteria Reinforcement Learning

  • Shie Mannor
  • Nahum Shimkin

We consider the problem of learning to attain multiple goals in a dynamic envi- ronment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm com- bines, in an appropriate way, a flnite set of standard, scalar-reward learning algo- rithms. Su–cient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well.

NeurIPS Conference 2000 Conference Paper

Weak Learners and Improved Rates of Convergence in Boosting

  • Shie Mannor
  • Ron Meir

The problem of constructing weak classifiers for boosting algo(cid: 173) rithms is studied. We present an algorithm that produces a linear classifier that is guaranteed to achieve an error better than random guessing for any distribution on the data. While this weak learner is not useful for learning in general, we show that under reasonable conditions on the distribution it yields an effective weak learner for one-dimensional problems. Preliminary simulations suggest that similar behavior can be expected in higher dimensions, a result which is corroborated by some recent theoretical bounds. Addi(cid: 173) tionally, we provide improved convergence rate bounds for the gen(cid: 173) eralization error in situations where the empirical error can be made small, which is exactly the situation that occurs if weak learners with guaranteed performance that is better than random guessing can be established.