Author name cluster

Gugan Thoppe

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

AAAI Conference 2026 Conference Paper

Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to Q-Learning

Ankur Naskar
Gugan Thoppe
Vijay Gupta

Algorithms for solving nonlinear fixed-point equations---such as average-reward Q-learning and TD-learning---often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm's contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free ~O(1/√t) optimal rates for Q-learning in both average-reward and exponentially discounted settings, where t denotes the iteration index. The result applies within a broad framework that accommodates both synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained from either simulators or along Markovian trajectories.

PDF Details DOI

ICML Conference 2025 Conference Paper

Policy Gradient with Tree Expansion

Gal Dalal
Assaf Hallak
Gugan Thoppe
Shie Mannor
Gal Chechik

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax—a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

Details

TMLR Journal 2024 Journal Article

Global Convergence Guarantees for Federated Policy Gradient Methods with Adversaries

Swetha Ganesh
Jiayu Chen
Gugan Thoppe
Vaneet Aggarwal

Federated Reinforcement Learning (FRL) allows multiple agents to collaboratively build a decision making policy without sharing raw trajectories. However, if a small fraction of these agents are adversarial, it can lead to catastrophic results. We propose a policy gradient based approach that is robust to adversarial agents which can send arbitrary values to the server. Under this setting, our results form the first global convergence guarantees with general parametrization. These results demonstrate resilience with adversaries, while achieving optimal sample complexity of order $\tilde{\mathcal{O}}\left( \frac{1}{N\epsilon^2} \left( 1+ \frac{f^2}{N}\right)\right)$, where $N$ is the total number of agents and $f < N/2$ is the number of adversarial agents.

PDF Details

ICML Conference 2024 Conference Paper

Risk Estimation in a Markov Cost Process: Lower and Upper Bounds

Gugan Thoppe
Prashanth L. A.
Sanjay P. Bhat

We tackle the problem of estimating risk measures of the infinite-horizon discounted cost of a Markov cost process. The risk measures we study include variance, Value-at-Risk (VaR), and Conditional Value-at-Risk (CVaR). First, we show that estimating any of these risk measures with $\epsilon$-accuracy, either in expected or high-probability sense, requires at least $\Omega(1/\epsilon^2)$ samples. Then, using a truncation scheme, we derive an upper bound for the CVaR and variance estimation. This bound matches our lower bound up to logarithmic factors. Finally, we discuss an extension of our estimation scheme that covers more general risk measures satisfying a certain continuity criterion, such as spectral risk measures and utility-based shortfall risk. To the best of our knowledge, our work is the first to provide lower and upper bounds for estimating any risk measure beyond the mean within a Markovian setting. Our lower bounds also extend to the infinite-horizon discounted costs’ mean. Even in that case, our lower bound of $\Omega(1/\epsilon^2) $ improves upon the existing $\Omega(1/\epsilon)$ bound (Metelli et al. 2023.

Details

UAI Conference 2023 Conference Paper

Does Momentum Help in Stochastic Optimization? A Sample Complexity Analysis

Swetha Ganesh
Rohan Deb
Gugan Thoppe
Amarjit Budhiraja

Stochastic Heavy Ball (SHB) and Nesterov’s Accelerated Stochastic Gradient (ASG) are popular momentum methods in optimization. While the benefits of these acceleration ideas in deterministic settings are well understood, their advantages in stochastic optimization are unclear. Several works have recently claimed that SHB and ASG always help in stochastic optimization. Our work shows that i.) these claims are either flawed or one-sided (e. g. , consider only the bias term but not the variance), and ii.) when both these terms are accounted for, SHB and ASG do not always help. Specifically, for any quadratic optimization, we obtain a lower bound on the sample complexity of SHB and ASG, accounting for both bias and variance, and show that the vanilla SGD can achieve the same bound.

Details

AAAI Conference 2020 Conference Paper

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Gal Dalal
Balazs Szorenyi
Gugan Thoppe

Policy evaluation in reinforcement learning is often conducted using two-timescale stochastic approximation, which results in various gradient temporal difference methods such as GTD(0), GTD2, and TDC. Here, we provide convergence rate bounds for this suite of algorithms. Algorithms such as these have two iterates, θn and wn, which are updated using two distinct stepsize sequences, αn and βn, respectively. Assuming αn = n−α and βn = n−β with 1 > α > β > 0, we show that, with high probability, the two iterates converge to their respective solutions θ∗ and w∗ at rates given by θn − θ∗ = Õ(n−α/2 ) and wn − w∗ = Õ(n−β/2 ); here, Õ hides logarithmic terms. Via comparable lower bounds, we show that these bounds are, in fact, tight. To the best of our knowledge, ours is the ﬁrst ﬁnite-time analysis which achieves these rates. While it was known that the two timescale components decouple asymptotically, our results depict this phenomenon more explicitly by showing that it in fact happens from some ﬁnite time onwards. Lastly, compared to existing works, our result applies to a broader family of stepsizes, including non-square summable ones.

PDF Details

AAAI Conference 2018 Conference Paper

Finite Sample Analyses for TD(0) With Function Approximation

Gal Dalal
Balázs Szörényi
Gugan Thoppe
Shie Mannor

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing ﬁnite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the ﬁrst to provide such results. Works that managed to obtain convergence rates for online Temporal Difference (TD) methods analyzed somewhat modiﬁed versions of them that include projections and stepsize dependent on unknown problem parameters. Our analysis obviates these artiﬁcial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. Both are based on relatively unknown, recently developed stochastic approximation techniques.

PDF Details