Author name cluster

Orin Levy

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

AAAI Conference 2025 Conference Paper

Batch Ensemble for Variance Dependent Regret in Stochastic Bandits

Asaf Cassel
Orin Levy
Yishay Mansour

Efficiently trading off exploration and exploitation is one of the key challenges in online Reinforcement Learning (RL). Most works achieve this by carefully estimating the model uncertainty and following the so-called optimistic model. Inspired by practical ensemble methods, in this work we propose a simple and novel batch ensemble scheme that provably achieves near-optimal regret for stochastic Multi-Armed Bandits (MAB). Crucially, our algorithm has just a single parameter, namely the number of batches, and its value does not depend on distributional properties such as the scale and variance of the losses. We complement our theoretical results by demonstrating the effectiveness of our algorithm on synthetic benchmarks.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Orin Levy
Liad Erez
Alon Peled-Cohen
Yishay Mansour

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over $K$ actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class $\Pi$ we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where $D$ is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$. In this setting, we achieve an expected regret bound of $O(\sqrt{KTR_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta})$ assuming FIFO order, where $d_{\max}$ is the maximal delay, $R_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\beta$ is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class $\mathcal{F}$ and show that its stability parameter $\beta$ is bounded by $\log |\mathcal{F}|$, resulting in an expected regret bound of $O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|})$ which is a $\sqrt{d_{\max}}$ factor away from the lower bound of $\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|})$ that we also present.

PDF Details

EWRL Workshop 2024 Workshop Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

Orin Levy
Alon Cohen
Asaf Cassel
Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^2 \sqrt{ TH|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H log(1/\delta)} )$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon. In addition, $\mathcal{R}_{TH}( \mathcal{O} )$ is the sum of the square and log-loss regression oracles' regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

PDF

ICML Conference 2024 Conference Paper

Eluder-based Regret for Stochastic Contextual MDPs

Orin Levy
Asaf B. Cassel
Alon Cohen
Yishay Mansour

We present the E-UC$^3$RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of $ \widetilde{O}(H^3 \sqrt{T |S| |A|d_{\mathrm{E}}(\mathcal{P}) \log (|\mathcal{F}| |\mathcal{P}|/ \delta) )}) $, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon, $\mathcal{P}$ and $\mathcal{F}$ are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and $d_{\mathrm{E}}(\mathcal{P})$ is the Eluder dimension of $\mathcal{P}$ w. r. t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of independent interest.

Details

NeurIPS Conference 2024 Conference Paper

Online Weighted Paging with Unknown Weights

Orin Levy
Noam Touitou
Aviv Rosenberg

Online paging is a fundamental problem in the field of online algorithms, in which one maintains a cache of $k$ slots as requests for fetching pages arrive online. In the weighted variant of this problem, each page has its own fetching cost; a substantial line of work on this problem culminated in an (optimal) $O(\log k)$-competitive randomized algorithm, due to Bansal, Buchbinder and Naor (FOCS'07). Existing work for weighted paging assumes that page weights are known in advance, which is not always the case in practice. For example, in multi-level caching architectures, the expected cost of fetching a memory block is a function of its probability of being in a mid-level cache rather than the main memory. This complex property cannot be predicted in advance; over time, however, one may glean information about page weights through sampling their fetching cost multiple times. We present the first algorithm for online weighted paging that does not know page weights in advance, but rather learns from weight samples. In terms of techniques, this requires providing (integral) samples to a fractional solver, requiring a delicate interface between this solver and the randomized rounding scheme; we believe that our work can inspire online algorithms to other problems that involve cost sampling.

PDF Details DOI

EWRL Workshop 2024 Workshop Paper

Regret Guarantees for Adversarial Contextual Bandits with Delayed Feedback

Liad Erez
Orin Levy
Yishay Mansour

In this paper we present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem in the presence of delayed feedback, a scenario where reward observations arrive with delays chosen by an adversary. We study two fundamental frameworks in terms of the function classes used to derive regret bounds for CMAB. Firstly, for a finite policy class $ \Pi $, we establish an optimal regret bound of $ O \left( \sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|} \right) $, where $ K $ is the number of arms, $ T $ is the number of rounds, and $ D $ is the sum of delays. Secondly, assuming a finite contextual reward function class $ \mathcal{F} $ and access to an online least-square regression oracle $\mathcal{O}$ over $\mathcal{F}$, we achieve a regret bound of $\widetilde{O}(\sqrt{KT\cdot (\mathcal{R}_T(\mathcal{O})+\log (\delta^{-1}))} + \eta D + d_m)$ that holds with probability at least $1-\delta$, where $d_m$ is the maximal delay, $\mathcal{R}_T(\mathcal{O})$ is an upper bound on the oracle's regret and $\eta$ is a stability parameter associated with the oracle.

PDF

ICML Conference 2023 Conference Paper

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

Orin Levy
Alon Cohen
Asaf B. Cassel
Yishay Mansour

We present the OMG-CMDP! algorithm for regret minimization in adversarial Contextual MDPs. The algorithm operates under the minimal assumptions of realizable function class and access to online least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient online regression oracles), simple and robust to approximation errors. It enjoys an $\widetilde{O}(H^{2. 5} \sqrt{ T|S||A| ( \mathcal{R}_{TH}(\mathcal{O}) + H \log(\delta^{-1}) )})$ regret guarantee, with $T$ being the number of episodes, $S$ the state space, $A$ the action space, $H$ the horizon and $\mathcal{R}_{TH}(\mathcal{O}) = \mathcal{R}_{TH}(\mathcal{O}_{sq}^\mathcal{F}) + \mathcal{R}_{TH}(\mathcal{O}_{log}^\mathcal{P})$ is the sum of the square and log-loss regression oracles’ regret, used to approximate the context-dependent rewards and dynamics, respectively. To the best of our knowledge, our algorithm is the first efficient rate optimal regret minimization algorithm for adversarial CMDPs that operates under the minimal standard assumption of online function approximation.

Details

AAAI Conference 2023 Conference Paper

Optimism in Face of a Context:Regret Guarantees for Stochastic Contextual MDP

Orin Levy
Yishay Mansour

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound (up to poly-logarithmic factors) of order (H+1/pₘᵢₙ)H|S|³ᐟ²(|A|Tlog(max{|?|,|?|} /?))¹ᐟ² with probability 1−?, where? and? are finite and realizable function classes used to approximate the dynamics and rewards respectively, pₘᵢₙ is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of?((TH|S||A|ln|?| /ln|A| )¹ᐟ² ), on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains order of T³ᐟ⁴ regret.

PDF Details DOI

EWRL Workshop 2022 Workshop Paper

Learning Efficiently Function Approximation for Contextual MDP

Orin Levy
Yishay Mansour

We study learning contextual MDPs using a function approximation for both the rewards and the dynamics. We consider both the case that the dynamics dependent or independent of the context. For both models we derive polynomial sample and time complexity (assuming an efficient ERM oracle). Our methodology gives a general reduction from learning contextual MDP to supervised learning.

PDF Details

EWRL Workshop 2022 Workshop Paper

Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP

Orin Levy
Yishay Mansour

We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains Õ max{H, 1 pmin }H|S|3/2 q |A|T log max{|G|, |P|} δ regret bound, with probability 1 − δ, where P and G are finite and realizable function classes used to approximate the dynamics and rewards respectively, pmin is the minimum reachability parameter, S is the set of states, A the set of actions, H the horizon, and T the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i. e. , without additional knowledge regarding the function class, such as it being linear and etc.). In addition, we present a lower bound of Ω( p TH|S||A| ln(|G|)/ ln(|A|)), on the expected regret which holds even in the case of known dynamics.

PDF Details