Author name cluster

Navdeep Kumar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

ICLR Conference 2025 Conference Paper

Global Convergence of Policy Gradient in Average Reward MDPs

Navdeep Kumar
Yashaswini Murthy
Itai Shufaro
Kfir Yehuda Levy
R. Srikant 0001
Shie Mannor

We present the first comprehensive finite-time global convergence analysis of policy gradient for infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O(\frac{1}{T})$, where $T$ represents the number of iterations. Performance bounds for discounted reward MDPs cannot be easily extended to average reward MDPs as the bounds grow proportional to the fifth power of the effective horizon. Recent work on such extensions makes a smoothness assumption that has not been verified. Thus, our primary contribution is in providing the first complete proof that the policy gradient algorithm converges globally for average-reward MDPs, without such an assumption. We also obtain the corresponding finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations that empirically validate the result.

Details

NeurIPS Conference 2025 Conference Paper

Non-rectangular Robust MDPs with Normed Uncertainty Sets

Navdeep Kumar
Adarsh Gupta
Maxence Mohamed Elfatihi
Giorgia Ramponi
Kfir Y. Levy
Shie Mannor

Robust policy evaluation for non-rectangular uncertainty set is generally NP-hard, even in approximation. Consequently, existing approaches suffer from either exponential iteration complexity or significant accuracy gaps. Interestingly, we identify a powerful class of $L_p$-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \texttt{sa}-rectangular $L_p$-bounded sets and leverage its structural properties to derive a novel dual formulation for $L_p$ robust Markov Decision Processes (MDPs). This formulation reveals key insights into the adversary’s strategy and leads to the \textbf{first polynomial-time robust policy evaluation algorithm} for $L_1$-normed non-rectangular robust MDPs.

PDF Details

EWRL Workshop 2025 Workshop Paper

Non-rectangular Robust MDPs with Normed Uncertainty Sets

Navdeep Kumar
Adarsh Gupta
Maxence Mohamed Elfatihi
Giorgia Ramponi
Kfir Yehuda Levy
Shie Mannor

Robust policy evaluation for non-rectangular uncertainty set is generally NP-hard, even in approximation. Consequently, existing approaches suffer from either exponential iteration complexity or significant accuracy gaps. Interestingly, we identify a powerful class of Lp-bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \texttt{sa}-rectangular Lp-bounded sets and leverage its structural properties to derive a novel dual formulation for Lp robust Markov Decision Processes (MDPs). This formulation provides key insights into the adversary’s strategy and enables the development of an efficient robust policy evaluation algorithm for these Lp normed non-rectangular robust MDPs.

PDF

NeurIPS Conference 2025 Conference Paper

On the Convergence of Single-Timescale Actor-Critic

Navdeep Kumar
Priyank Agrawal
Giorgia Ramponi
Kfir Y. Levy
Shie Mannor

We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of $ O(\epsilon^{-3}) $. This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as $ O(k^{-\frac{2}{3}}) $ with iteration $k$, diverging from the conventional $O(k^{-\frac{1}{2}}) $ rates commonly used in (non)convex optimization.

PDF Details

EWRL Workshop 2025 Workshop Paper

On the Convergence of Single-Timescale Actor-Critic

Navdeep Kumar
Priyank Agrawal
Giorgia Ramponi
Kfir Yehuda Levy
Shie Mannor

We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces. To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm. Leveraging this framework, we establish that the algorithm converges to an $\epsilon$-close \textbf{globally optimal} policy with a sample complexity of $ O(\epsilon^{-3}) $. This significantly improves upon the existing complexity of $O(\epsilon^{-2})$ to achieve $\epsilon$-close \textbf{stationary policy}, which is equivalent to the complexity of $O(\epsilon^{-4})$ to achieve $\epsilon$-close \textbf{globally optimal} policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as $ O(k^{-\frac{2}{3}}) $ with iteration $k$, diverging from the conventional $ O(k^{-\frac{1}{2}}) $ rates commonly used in (non)convex optimization.

PDF

ICML Conference 2024 Conference Paper

Bring Your Own (Non-Robust) Algorithm to Solve Robust MDPs by Estimating The Worst Kernel

Uri Gadot
Kaixin Wang
Navdeep Kumar
Kfir Yehuda Levy
Shie Mannor

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, current RMDP methods are often limited to small-scale problems, hindering their use in high-dimensional domains. To bridge this gap, we present EWoK, a novel online approach to solve RMDP that Estimates the Worst transition Kernel to learn robust policies. Unlike previous works that regularize the policy or value updates, EWoK achieves robustness by simulating the worst scenarios for the agent while retaining complete flexibility in the learning process. Notably, EWoK can be applied on top of any off-the-shelf non-robust RL algorithm, enabling easy scaling to high-dimensional domains. Our experiments, spanning from simple Cartpole to high-dimensional DeepMind Control Suite environments, demonstrate the effectiveness and applicability of the EWoK paradigm as a practical method for learning robust policies.

Details

ICML Conference 2024 Conference Paper

Efficient Value Iteration for s-rectangular Robust Markov Decision Processes

Navdeep Kumar
Kaixin Wang
Kfir Yehuda Levy
Shie Mannor

We focus on s-rectangular robust Markov decision processes (MDPs), which capture interconnected uncertainties across different actions within each state. This framework is more general compared to sa-rectangular robust MDPs, where uncertainties in each action are independent. However, the introduced interdependence significantly amplifies the complexity of the problem. Existing methods either have slow performance guarantees or are inapplicable to even moderately large state spaces. In this work, we derive optimal robust Bellman operators in explicit forms. This leads to robust value iteration methods with significantly faster time complexities than existing approaches, which can be used in large state spaces. Further, our findings reveal that the optimal policies demonstrate a novel threshold behavior, selectively favoring a limited set of actions based on their respective advantage functions. Additionally, our study uncovers a noteworthy connection between the robustness of a policy and the variance in its value function, highlighting that policies with lower variance exhibit greater resilience.

Details

AAAI Conference 2024 Conference Paper

Solving Non-rectangular Reward-Robust MDPs via Frequency Regularization

Uri Gadot
Esther Derman
Navdeep Kumar
Maxence Mohamed Elfatihi
Kfir Levy
Shie Mannor

In robust Markov decision processes (RMDPs), it is assumed that the reward and the transition dynamics lie in a given uncertainty set. By targeting maximal return under the most adversarial model from that set, RMDPs address performance sensitivity to misspecified environments. Yet, to preserve computational tractability, the uncertainty set is traditionally independently structured for each state. This so-called rectangularity condition is solely motivated by computational concerns. As a result, it lacks a practical incentive and may lead to overly conservative behavior. In this work, we study coupled reward RMDPs where the transition kernel is fixed, but the reward function lies within an alpha-radius from a nominal one. We draw a direct connection between this type of non-rectangular reward-RMDPs and applying policy visitation frequency regularization. We introduce a policy-gradient method, and prove its convergence. Numerical experiments illustrate the learned policy's robustness and its less conservative behavior when compared to rectangular uncertainty.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Policy Gradient for Rectangular Robust Markov Decision Processes

Navdeep Kumar
Esther Derman
Matthieu Geist
Kfir Y. Levy
Shie Mannor

Policy gradient methods have become a standard for training reinforcement learning agents in a scalable and efficient manner. However, they do not account for transition uncertainty, whereas learning robust policies can be computationally expensive. In this paper, we introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs). We provide a closed-form expression for the worst occupation measure. Incidentally, we find that the worst kernel is a rank-one perturbation of the nominal. Combining the worst occupation measure with a robust Q-value estimation yields an explicit form of the robust gradient. Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent. Hence, it relieves the computational burden of convex optimization problems required for training robust policies by current policy gradient approaches.

PDF Details

EWRL Workshop 2023 Workshop Paper

Robust Reinforcement Learning via Adversarial Kernel Approximation

Kaixin Wang
Uri Gadot
Navdeep Kumar
Kfir Yehuda Levy
Shie Mannor

Robust Markov Decision Processes (RMDPs) provide a framework for sequential decision-making that is robust to perturbations on the transition kernel. However, robust reinforcement learning (RL) approaches in RMDPs do not scale well to realistic online settings with high-dimensional domains. By characterizing the adversarial kernel in RMDPs, we propose a novel approach for online robust RL that approximates the adversarial kernel and uses a standard (non-robust) RL algorithm to learn a robust policy. Notably, our approach can be applied on top of any underlying RL algorithm, enabling easy scaling to high-dimensional domains. Experiments in classic control tasks, MinAtar and DeepMind Control Suite demonstrate the effectiveness and the applicability of our method.

PDF

EWRL Workshop 2023 Workshop Paper

Towards Faster Global Convergence of Robust Policy Gradient Methods

Navdeep Kumar
Ilnura Usmanova
Kfir Yehuda Levy
Shie Mannor

Recently, global convergence has been achieved for non-robust MDPs with an iteration complexity of $O(\frac{1}{\epsilon})$ for finding an $\epsilon$-optimal policy, for which PL condition derived from performance difference lemma has played a key role. This work extends performance difference lemma to \texttt{s}-rectangular robust MDPs from which PL condition can be derived. We further, present a simplified proof for the policy gradient convergence for non-robust case, which together with robust performance difference lemma, can lead to global convergence of robust policy gradient.

PDF

EWRL Workshop 2022 Workshop Paper

$Q$-Learning for $L_p$ Robust Markov Decision Processes

Navdeep Kumar
Kaixin Wang
Kfir Levy
Shie Mannor

Robust Markov Decision Processes (MDPs) are a powerful tool to solve the sequential decision-making problem where system parameters are partially known or changing or adversarial. Recently, there have been works aimed at solving sa and s-rectangular robust MDPs. The methods are model-based that can potentially be generalized to model-free settings. We formally propose model-free algorithm for sa and s-rectangular Lp robust MDPs and provide its convergence guarantees. The proposed model-free algorithms can be combined with existing deep RL techniques such as DQN etc. to solve challenging problems.

PDF Details

ICML Conference 2022 Conference Paper

The Geometry of Robust Value Functions

Kaixin Wang
Navdeep Kumar
Kuangqi Zhou
Bryan Hooi
Jiashi Feng
Shie Mannor

The space of value functions is a fundamental concept in reinforcement learning. Characterizing its geometric properties may provide insights for optimization and representation. Existing works mainly focus on the value space for Markov Decision Processes (MDPs). In this paper, we study the geometry of the robust value space for the more general Robust MDPs (RMDPs) setting, where transition uncertainties are considered. Specifically, since we find it hard to directly adapt prior approaches to RMDPs, we start with revisiting the non-robust case, and introduce a new perspective that enables us to characterize both the non-robust and robust value space in a similar fashion. The key of this perspective is to decompose the value space, in a state-wise manner, into unions of hypersurfaces. Through our analysis, we show that the robust value space is determined by a set of conic hypersurfaces, each of which contains the robust values of all policies that agree on one state. Furthermore, we find that taking only extreme points in the uncertainty set is sufficient to determine the robust value space. Finally, we discuss some other aspects about the robust value space, including its non-convexity and policy agreement on multiple states.

Details