Arrow Research search

Author name cluster

Alon Cohen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

EWRL Workshop 2024 Workshop Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

  • Orin Levy
  • Alon Cohen
  • Asaf Cassel
  • Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^2 \sqrt{ TH|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H log(1/\delta)} )$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon. In addition, $\mathcal{R}_{TH}( \mathcal{O} )$ is the sum of the square and log-loss regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

ICML Conference 2024 Conference Paper

Eluder-based Regret for Stochastic Contextual MDPs

  • Orin Levy
  • Asaf B. Cassel
  • Alon Cohen
  • Yishay Mansour

We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ \widetilde{O}(H^3 \sqrt{T |S| |A|d_{\mathrm{E}}(\mathcal{P}) \log (|\mathcal{F}| |\mathcal{P}|/ \delta) )}) $, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon, $\mathcal{P}$ and $\mathcal{F}$ are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and $d_{\mathrm{E}}(\mathcal{P})$ is the Eluder dimension of $\mathcal{P}$ w. r. t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of independent interest.

NeurIPS Conference 2024 Conference Paper

Fast Rates for Bandit PAC Multiclass Classification

  • Liad Erez
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour
  • Shay Moran

We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(\varepsilon, \delta)$-PAC version of the problem, with sample complexity of $O\big( (\operatorname{poly}(K) + 1 / \varepsilon^2) \log (|\mathcal{H}| / \delta) \big)$ for any finite hypothesis class $\mathcal{H}$. In terms of the leading dependence on $\varepsilon$, this improves upon existing bounds for the problem, that are of the form $O(K/\varepsilon^2)$. We also provide an extension of this result to general classes and establish similar sample complexity bounds in which $\log |\mathcal{H}|$ is replaced by the Natarajan dimension. This matches the optimal rate in the full-information version of the problem and resolves an open question studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011) who demonstrated that the multiplicative price of bandit feedback in realizable PAC learning is $\Theta(K)$. We complement this by revealing a stark contrast with the agnostic case, where the price of bandit feedback is only $O(1)$ as $\varepsilon \to 0$. Our algorithm utilizes a stochastic optimization technique to minimize a log-barrier potential based on Frank-Wolfe updates for computing a low-variance exploration distribution over the hypotheses, and is made computationally efficient provided access to an ERM oracle over $\mathcal{H}$.

ICML Conference 2024 Conference Paper

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

  • Uri Sherman
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We study regret minimization in online episodic linear Markov Decision Processes, and propose a policy optimization algorithm that is computationally efficient, and obtains rate optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal rate (in terms of $K$) of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee was previously known.

EWRL Workshop 2024 Workshop Paper

Rate-Optimal Policy Optimization for Linear Markov Decision Processes

  • Uri Sherman
  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We study regret minimization in online episodic linear Markov Decision Processes, and propose a policy optimization algorithm that is computationally efficient, and obtains rate optimal $\widetilde O (\sqrt K)$ regret where $K$ denotes the number of episodes. Our work is the first to establish the optimal rate (in terms of~$K$) of convergence in the stochastic setting with bandit feedback using a policy optimization based approach, and the first to establish the optimal rate in the adversarial setup with full information feedback, for which no algorithm with an optimal rate guarantee was previously known.

EWRL Workshop 2023 Workshop Paper

APART: Diverse Skill Discovery using All Pairs with Ascending Reward and DropouT

  • Hadar Schreiber
  • Tom Zahavy
  • Guillaume Desjardins
  • Alon Cohen

We study diverse skill discovery in reward-free environments, aiming to discover all possible skills in simple grid-world environments where prior methods have struggled to succeed. This problem is formulated as mutual training of skills using an intrinsic reward and a discriminator trained to predict a skill given its trajectory. Our initial solution replaces the standard one-vs-all (softmax) discriminator with a one-vs-one (all pairs) discriminator and combines it with a novel intrinsic reward function and a dropout regularization technique. The combined approach is named APART: Diverse Skill Discovery using All Pairs with Ascending Reward and Dropout. We demonstrate that APART discovers all the possible skills in grid worlds with remarkably fewer samples than previous works. Motivated by the empirical success of APART, we further investigate an even simpler algorithm that achieves maximum skills by altering VIC, rescaling its intrinsic reward, and tuning the temperature of its softmax discriminator. We believe our findings shed light on the crucial factors underlying success of skill discovery algorithms in reinforcement learning.

ICML Conference 2023 Conference Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

  • Orin Levy
  • Alon Cohen
  • Asaf B. Cassel
  • Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2. 5} \sqrt{ T|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}_{TH}(\mathcal{O}) = \mathcal{R}_{TH}(\mathcal{O}_{sq}^\mathcal{F}) + \mathcal{R}_{TH}(\mathcal{O}_{log}^\mathcal{P})$ is the sum of the square and log-loss regression oracles’ regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

EWRL Workshop 2022 Workshop Paper

Rate-Optimal Online Convex Optimization in Adaptive Linear Control

  • Asaf B Cassel
  • Alon Cohen
  • Tomer Koren

We consider the problem of controlling an unknown linear dynamical system under adversarially changing convex costs and full feedback of both the state and cost function. We present the first computationally-efficient algorithm that attains an optimal √ T-regret rate compared to the best stabilizing linear controller in hindsight, while avoiding stringent assumptions on the costs such as strong convexity. Our approach is based on a careful design of non-convex lower confidence bounds for the online costs, and uses a novel technique for computationally-efficient regret minimization of these bounds that leverages their particular non-convex structure.

NeurIPS Conference 2021 Conference Paper

Asynchronous Stochastic Optimization Robust to Arbitrary Delays

  • Alon Cohen
  • Amit Daniely
  • Yoel Drori
  • Tomer Koren
  • Mariano Schain

We consider the problem of stochastic optimization with delayed gradients in which, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$. Here, $\tau$ is the \emph{average} delay $\frac{1}{T}\sum_{t=1}^T d_t$ and $\sigma^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.

NeurIPS Conference 2021 Conference Paper

Minimax Regret for Stochastic Shortest Path

  • Alon Cohen
  • Yonathan Efroni
  • Yishay Mansour
  • Aviv Rosenberg

We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) |S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 |S| |A| K})$ lower bound of Rosenberg et al. [2020] for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{|S|}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star |S| |A| K})$. Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.

AAAI Conference 2020 Conference Paper

Apprenticeship Learning via Frank-Wolfe

  • Tom Zahavy
  • Alon Cohen
  • Haim Kaplan
  • Yishay Mansour

We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope – the convex hull of the feature expectations of all the deterministic policies in the MDP. We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent wellknown Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking “away steps” achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL.

ICML Conference 2020 Conference Paper

Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently

  • Asaf B. Cassel
  • Alon Cohen
  • Tomer Koren

We consider the problem of learning in Linear Quadratic Control systems whose transition parameters are initially unknown. Recent results in this setting have demonstrated efficient learning algorithms with regret growing with the square root of the number of decision steps. We present new efficient algorithms that achieve, perhaps surprisingly, regret that scales only (poly-)logarithmically with the number of steps, in two scenarios: when only the state transition matrix A is unknown, and when only the state-action transition matrix B is unknown and the optimal policy satisfies a certain non-degeneracy condition. On the other hand, we give a lower bound which shows that when the latter condition is violated, square root regret is unavoidable.

ICML Conference 2020 Conference Paper

Near-optimal Regret Bounds for Stochastic Shortest Path

  • Aviv Rosenberg 0002
  • Alon Cohen
  • Yishay Mansour
  • Haim Kaplan

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i. e. , the transition function) and has to repeatedly play for a given number of episodes, while learning the problem’s optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent’s actions. Recently, \cite{tarbouriech2019noregret} studied this problem in the context of regret minimization, and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of $\widetilde{O}(B^{3/2} S \sqrt{A K})$, where $B$ is an upper bound on the expected cost of the optimal policy, $S$ is the number of states, $A$ is the number of actions and $K$ is the total number of episodes. We additionally show that any learning algorithm must have at least $\Omega(B \sqrt{S A K})$ regret in the worst case.

UAI Conference 2020 Conference Paper

Unknown mixing times in apprenticeship and reinforcement learning

  • Tom Zahavy
  • Alon Cohen
  • Haim Kaplan
  • Yishay Mansour

We derive and analyze learning algorithms for apprenticeship learning, policy evaluation and policy gradient for average reward criteria. Existing algorithms explicitly require an upper bound on the mixing time. In contrast, we build on ideas from Markov chain theory and derive sampling algorithms that do not require such an upper bound. For these algorithms, we provide theoretical bounds on their sample-complexity and running time.

ICML Conference 2019 Conference Paper

Learning Linear-Quadratic Regulators Efficiently with only √T Regret

  • Alon Cohen
  • Tomer Koren
  • Yishay Mansour

We present the first computationally-efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesvari (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

NeurIPS Conference 2019 Conference Paper

Learning to Screen

  • Alon Cohen
  • Avinatan Hassidim
  • Haim Kaplan
  • Yishay Mansour
  • Shay Moran

Imagine a large firm with multiple departments that plans a large recruitment. Candidates arrive one-by-one, and for each candidate the firm decides, based on her data (CV, skills, experience, etc), whether to summon her for an interview. The firm wants to recruit the best candidates while minimizing the number of interviews. We model such scenarios as an assignment problem between items (candidates) and categories (departments): the items arrive one-by-one in an online manner, and upon processing each item the algorithm decides, based on its value and the categories it can be matched with, whether to retain or discard it (this decision is irrevocable). The goal is to retain as few items as possible while guaranteeing that the set of retained items contains an optimal matching. We consider two variants of this problem: (i) in the first variant it is assumed that the $n$ items are drawn independently from an unknown distribution $D$. (ii) In the second variant it is assumed that before the process starts, the algorithm has an access to a training set of $n$ items drawn independently from the same unknown distribution (e. g. \ data of candidates from previous recruitment seasons). We give tight bounds on the minimum possible number of retained items in each of these variants. These results demonstrate that one can retain exponentially less items in the second variant (with the training set). Our algorithms and analysis utilize ideas and techniques from statistical learning theory and from discrete algorithms.

ICML Conference 2018 Conference Paper

Online Linear Quadratic Control

  • Alon Cohen
  • Avinatan Hassidim
  • Tomer Koren
  • Nevena Lazic
  • Yishay Mansour
  • Kunal Talwar

We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee $O(\sqrt{T})$ regret under mild assumptions, where $T$ is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to “strongly stable” policies that mix exponentially fast to a steady state.

IJCAI Conference 2018 Conference Paper

Planning and Learning with Stochastic Action Sets

  • Craig Boutilier
  • Alon Cohen
  • Avinatan Hassidim
  • Yishay Mansour
  • Ofer Meshi
  • Martin Mladenov
  • Dale Schuurmans

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities, and a sampling-based model for unknown distributions. We develop polynomial-time value and policy iteration methods for both cases, and provide a polynomial-time linear programming solution for the first case.

ICML Conference 2016 Conference Paper

Online Learning with Feedback Graphs Without the Graphs

  • Alon Cohen
  • Tamir Hazan
  • Tomer Koren

We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emphnever fully revealed to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves \widetildeΘ(\sqrtαT) regret over T rounds, provided that the independence numbers of the hidden feedback graphs are at most α. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

ICML Conference 2015 Conference Paper

Following the Perturbed Leader for Online Structured Learning

  • Alon Cohen
  • Tamir Hazan

We investigate a new Follow the Perturbed Leader (FTPL) algorithm for online structured prediction problems. We show a regret bound which is comparable to the state of the art of FTPL algorithms and is comparable with the best possible regret in some cases. To better understand FTPL algorithms for online structured learning, we present a lower bound on the regret for a large and natural class of FTPL algorithms that use logconcave perturbations. We complete our investigation with an online shortest path experiment and empirically show that our algorithm is both statistically and computationally efficient.

ICAPS Conference 2013 Conference Paper

Faster Optimal Planning with Partial-Order Pruning

  • David Leo Wright Hall
  • Alon Cohen
  • David Burkett
  • Dan Klein 0001

When planning problems have many kinds of resources or high concurrency, each optimal state has exponentially many minor variants, some of which are "better" than others. Standard methods like \Astar cannot effectively exploit these minor relative differences, and therefore must explore many redundant, clearly suboptimal plans. We describe a new optimal search algorithm for planning that leverages a partial order relation between states. Under suitable conditions, states that are dominated by other states with respect to this order can be pruned while provably maintaining optimality. We also describe a simple method for automatically discovering compatible partial orders in both serial and concurrent domains. In our experiments we find that more than 98% of search states can be pruned in some domains.