Arrow Research search

Author name cluster

Alec Koppel

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

TMLR Journal 2025 Journal Article

Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

  • Peihong Yu
  • Manav Mishra
  • Alec Koppel
  • Carl Busart
  • Priya Narayan
  • Dinesh Manocha
  • Amrit Singh Bedi
  • Pratap Tokekar

Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of individual agent behavior with demonstrations, and the second regulates incentives based on whether the behaviors lead to the desired outcome. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL’s capability of leveraging joint demonstrations in the StarCraft scenario and converging effectively even with demonstrations from non-co-trained policies.

RLC Conference 2025 Conference Paper

Building Sequential Resource Allocation Mechanisms without Payments

  • Sihan Zeng
  • Sujay Bhatt
  • Alec Koppel
  • Sumitra Ganesh

We study allocating divisible resources of limited quantities to agents who submit requests for the resources one or multiple times over a finite horizon. This is referred to as the sequential or online resource allocation problem, as irrevocable allocations need to be made as the requests arrive, without observations on the future requests. The existing work on sequential resource allocation (in the payment-free setting) mainly focuses on optimizing social welfare and designs mechanisms under the assumption that the agents make truthful requests. Such mechanisms can be easily exploitable -- strategic agents may misreport their requests to inflate their allocations. Our aim in this work is to design sequential resource allocation mechanisms that balance the competing objectives of social welfare maximization (promoting the overall agent satisfaction) and incentive compatibility (ensuring that the agents do not have incentives to misreport). We do not design these mechanisms from scratch. Instead, as the incentive compatible mechanism design problem has been well studied in the one-shot setting (horizon length equals one), we propose a general meta-algorithm of transforming a one-shot mechanism into its sequential counterpart. The meta-algorithm can plug in any one-shot mechanism and approximately carry over the properties that the one-shot mechanism already satisfies to the sequential setting. We establish theoretical results validating these claims and illustrate their superior performance relative to baselines in experiments.

RLJ Journal 2025 Journal Article

Building Sequential Resource Allocation Mechanisms without Payments

  • Sihan Zeng
  • Sujay Bhatt
  • Alec Koppel
  • Sumitra Ganesh

We study allocating divisible resources of limited quantities to agents who submit requests for the resources one or multiple times over a finite horizon. This is referred to as the sequential or online resource allocation problem, as irrevocable allocations need to be made as the requests arrive, without observations on the future requests. The existing work on sequential resource allocation (in the payment-free setting) mainly focuses on optimizing social welfare and designs mechanisms under the assumption that the agents make truthful requests. Such mechanisms can be easily exploitable -- strategic agents may misreport their requests to inflate their allocations. Our aim in this work is to design sequential resource allocation mechanisms that balance the competing objectives of social welfare maximization (promoting the overall agent satisfaction) and incentive compatibility (ensuring that the agents do not have incentives to misreport). We do not design these mechanisms from scratch. Instead, as the incentive compatible mechanism design problem has been well studied in the one-shot setting (horizon length equals one), we propose a general meta-algorithm of transforming a one-shot mechanism into its sequential counterpart. The meta-algorithm can plug in any one-shot mechanism and approximately carry over the properties that the one-shot mechanism already satisfies to the sequential setting. We establish theoretical results validating these claims and illustrate their superior performance relative to baselines in experiments.

ICLR Conference 2025 Conference Paper

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

  • Souradip Chakraborty
  • Sujay Bhatt
  • Udari Madhushani
  • Soumya Suvra Ghosal
  • Jiahao Qiu
  • Mengdi Wang 0001
  • Dinesh Manocha
  • Furong Huang

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences, and broader utilities, but it requires updating billions of model parameters which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agents-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward, for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, COLLAB surpasses the current SoTA decoding strategy, achieving an improvement of {up to 1.56x} in average reward and $71.89\%$ in GPT-4 based win-tie rate.

IROS Conference 2025 Conference Paper

Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

  • Bhrij Patel
  • Kasun Weerakoon
  • Wesley A. Suttle
  • Alec Koppel
  • Brian M. Sadler
  • Tianyi Zhou 0001
  • Dinesh Manocha
  • Amrit Singh Bedi

Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18% higher success rate, 20-38% shorter paths, and 9. 32% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.

AAAI Conference 2025 Conference Paper

Decentralized Convergence to Equilibrium Prices in Trading Networks

  • Edwin Lock
  • Benjamin Patrick Evans
  • Eleonora Kreacic
  • Sujay Bhatt
  • Alec Koppel
  • Sumitra Ganesh
  • Paul W. Goldberg

We propose a decentralized market model in which agents can negotiate bilateral contracts. This builds on a similar, but centralized, model of trading networks introduced by Hatfield et al. in 2013. Prior work has established that fully-substitutable preferences guarantee the existence of competitive equilibria which can be centrally computed. Our motivation comes from the fact that prices in markets such as over-the-counter markets and used car markets arise from decentralized negotiation among agents, which has left open an important question as to whether equilibrium prices can emerge from agent-to-agent bilateral negotiations. We design a best response dynamic intended to capture such negotiations between market participants. We assume fully substitutable preferences for market participants. In this setting, we provide proofs of convergence for sparse markets (covering many real world markets of interest), and experimental results for more general cases, demonstrating that prices indeed reach equilibrium, quickly, via bilateral negotiations. Our best response dynamic, and its convergence behavior, forms an important first step in understanding how decentralized markets reach, and retain, equilibrium.

ICLR Conference 2025 Conference Paper

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-Time Alignment

  • Yuancheng Xu
  • Udari Madhushani
  • Alec Koppel
  • Sicheng Zhu
  • Bang An 0001
  • Furong Huang
  • Sumitra Ganesh

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model—a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

NeurIPS Conference 2025 Conference Paper

Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis

  • Sihan Zeng
  • Benjamin Patrick Evans
  • Sujay Bhatt
  • Leo Ardon
  • Sumitra Ganesh
  • Alec Koppel

We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi-level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader’s and followers’ objectives, use samples inefficiently due to nested-loop algorithm structure, and lack finite-time convergence guarantees. To address these limitations, we propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi-)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees. Our key assumption is a "gradient alignment" condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader-follower independence assumption. Simulation results in a range of well-established economics environments demonstrate that AC-SMFG outperforms existing multi-agent and MFG learning baselines in policy quality and convergence speed.

TMLR Journal 2024 Journal Article

Byzantine-Resilient Decentralized Multi-Armed Bandits

  • Jingxuan Zhu
  • Alec Koppel
  • Alvaro Velasquez
  • Ji Liu

In decentralized cooperative multi-armed bandits (MAB), each agent observes a distinct stream of rewards, and seeks to exchange information with others to select a sequence of arms so as to minimize its regret. Agents in the cooperative setting can outperform a single agent running a MAB method such as Upper-Confidence Bound (UCB) independently. In this work, we study how to recover such salient behavior when an unknown fraction of the agents can be Byzantine, that is, communicate arbitrarily wrong information in the form of reward mean-estimates or confidence sets. This framework can be used to model attackers in computer networks, instigators of offensive content into recommender systems, or manipulators of financial markets. Our key contribution is the development of a fully decentralized resilient upper confidence bound (UCB) algorithm that fuses an information mixing step among agents with a truncation of inconsistent and extreme values. This truncation step enables us to establish that the performance of each normal agent is no worse than the classic single-agent UCB1 algorithm in terms of regret, and more importantly, the cumulative regret of all normal agents is strictly better than the non-cooperative case, provided that each agent has at least $3f+1$ neighbors where $f$ is the maximum possible Byzantine agents in each agent's neighborhood. Extensions to time-varying neighbor graphs, and minimax lower bounds are further established on the achievable regret. Experiments corroborate the merits of this framework in practice.

ICLR Conference 2024 Conference Paper

Efficient Inverse Multiagent Learning

  • Denizalp Goktas
  • Amy Greenwald
  • Sadie Zhao
  • Alec Koppel
  • Sumitra Ganesh

In this paper, we study inverse game theory (resp. inverse multiagent learning) in which the goal is to find parameters of a game’s payoff functions for which the expected (resp. sampled) behavior is an equilibrium. We formulate these problems as generative-adversarial (i.e., min-max) optimization problems, which we develop polynomial-time algorithms to solve, the former of which relies on an exact first- order oracle, and the latter, a stochastic one. We extend our approach to solve inverse multiagent simulacral learning in polynomial time and number of samples. In these problems, we seek a simulacrum, meaning parameters and an associated equilibrium that replicate the given observations in expectation. We find that our approach outperforms the widely-used ARIMA method in predicting prices in Spanish electricity markets based on time-series data.

ICML Conference 2024 Conference Paper

Information-Directed Pessimism for Offline Reinforcement Learning

  • Alec Koppel
  • Sujay Bhatt
  • Jiacheng Guo
  • Joe Eappen
  • Mengdi Wang 0001
  • Sumitra Ganesh

Policy optimization from batch data, i. e. , offline reinforcement learning (RL) is important when collecting data from a current policy is not possible. This setting incurs distribution mismatch between batch training data and trajectories from the current policy. Pessimistic offsets estimate mismatch using concentration bounds, which possess strong theoretical guarantees and simplicity of implementation. Mismatch may be conservative in sparse data regions and less so otherwise, which can result in under-performing their no-penalty variants in practice. We derive a new pessimistic penalty as the distance between the data and the true distribution using an evaluable one-sample test known as Stein Discrepancy that requires minimal smoothness conditions, and noticeably, allows a mixture family representation of distribution over next states. This entity forms a quantifier of information in offline data, which justifies calling this approach information-directed pessimism (IDP) for offline RL. We further establish that this new penalty based on discrete Stein discrepancy yields practical gains in performance while generalizing the regret of prior art to multimodal distributions.

ICML Conference 2024 Conference Paper

MaxMin-RLHF: Alignment with Diverse Human Preferences

  • Souradip Chakraborty
  • Jiahao Qiu
  • Hui Yuan 0002
  • Alec Koppel
  • Dinesh Manocha
  • Furong Huang
  • Amrit Singh Bedi
  • Mengdi Wang 0001

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

JMLR Journal 2024 Journal Article

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control

  • Amrit Singh Bedi
  • Anjaly Parayil
  • Junyu Zhang
  • Mengdi Wang
  • Alec Koppel

Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter $\alpha$, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index $\alpha$, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Lévy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Alec Koppel
  • Huazheng Wang
  • Dinesh Manocha
  • Mengdi Wang 0001
  • Furong Huang

We present a novel unified bilevel optimization-based framework, \textsf{PARL}, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning using utility or preference-based feedback. We identify a major gap within current algorithmic designs for solving policy alignment due to a lack of precise characterization of the dependence of the alignment objective on the data generated by policy trajectories. This shortfall contributes to the sub-optimal performance observed in contemporary algorithms. Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable (optimal policy for the designed reward). Interestingly, from an optimization perspective, our formulation leads to a new class of stochastic bilevel problems where the stochasticity at the upper objective depends upon the lower-level variable. {True to our best knowledge, this work presents the first formulation of the RLHF as a bilevel optimization problem which generalizes the existing RLHF formulations and addresses the existing distribution shift issues in RLHF formulations.} To demonstrate the efficacy of our formulation in resolving alignment issues in RL, we devised an algorithm named \textsf{A-PARL} to solve PARL problem, establishing sample complexity bounds of order $\mathcal{O}(1/T)$. Our empirical results substantiate that the proposed \textsf{PARL} can address the alignment concerns in RL by showing significant improvements (up to 63\% in terms of required samples) for policy alignment in large-scale environments of the Deepmind control suite and Meta world tasks.

TMLR Journal 2024 Journal Article

Regularized Proportional Fairness Mechanism for Resource Allocation Without Money

  • Sihan Zeng
  • Sujay Bhatt
  • Alec Koppel
  • Sumitra Ganesh

Mechanism design in resource allocation studies dividing limited resources among self-interested agents whose satisfaction with the allocation depends on privately held utilities. We consider the problem in a payment-free setting, with the aim of maximizing social welfare while enforcing incentive compatibility (IC), i.e., agents cannot inflate allocations by misreporting their utilities. The well-known proportional fairness (PF) mechanism achieves the maximum possible social welfare but incurs an undesirably high exploitability (the maximum unilateral inflation in utility from misreport and a measure of deviation from IC). In fact, it is known that no mechanism can achieve the maximum social welfare and exact incentive compatibility (IC) simultaneously without the use of monetary incentives (Cole et al., 2013). Motivated by this fact, we propose learning an approximate mechanism that desirably trades off the competing objectives. Our main contribution is to design an innovative neural network architecture tailored to the resource allocation problem, which we name Regularized Proportional Fairness Network (RPF-Net). RPF-Net regularizes the output of the PF mechanism by a learned function approximator of the most exploitable allocation, with the aim of reducing the incentive for any agent to misreport. We derive generalization bounds that guarantee the mechanism performance when trained under finite and out-of-distribution samples and experimentally demonstrate the merits of the proposed mechanism compared to the state-of-the-art. The PF mechanism acts as an important benchmark for comparing the social welfare of any mechanism. However, there exists no established way of computing its exploitability. The challenge here is that we need to find the maximizer of an optimization problem for which the gradient is only implicitly defined. We for the first time provide a systematic method for finding such (sub)gradients, which enables the evaluation of the exploitability of the PF mechanism through iterative (sub)gradient ascent.

ICML Conference 2024 Conference Paper

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

  • Bhrij Patel
  • Wesley A. Suttle
  • Alec Koppel
  • Vaneet Aggarwal
  • Brian M. Sadler
  • Dinesh Manocha
  • Amrit Singh Bedi

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}(\sqrt{\tau_{mix}})$ known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

JAIR Journal 2023 Journal Article

Achieving Zero Constraint Violation for Concave Utility Constrained Reinforcement Learning via Primal-Dual Approach

  • Qinbo Bai
  • Amrit Singh Bedi
  • Mridul Agarwal
  • Alec Koppel
  • Vaneet Aggarwal

Reinforcement learning (RL) is widely used in applications where one needs to perform sequential decision-making while interacting with the environment. The standard RL problem with safety constraints is generally mathematically modeled by constrained Markov Decision Processes (CMDP), which is linear in objective and rules in occupancy measure space, where the problem becomes challenging in the case where the model is unknown apriori. The problem further becomes challenging when the decision requirement includes optimizing a concave utility while satisfying some nonlinear safety constraints. To solve such a nonlinear problem, we propose a conservative stochastic primal-dual algorithm (CSPDA) via a randomized primal-dual approach. By leveraging a generative model, we prove that CSPDA not only exhibits Õ(1/ε2) sample complexity, but also achieves zero constraint violations for the concave utility CMDP. Compared with the previous works, the best available sample complexity for CMDP with zero constraint violation is Õ(1/ε5). Hence, the proposed algorithm provides a significant improvement as compared to the state-of-the-art.

ICML Conference 2023 Conference Paper

Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic

  • Wesley A. Suttle
  • Amrit Singh Bedi
  • Bhrij Patel
  • Brian M. Sadler
  • Alec Koppel
  • Dinesh Manocha

Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call M ulti-level A ctor- C ritic (MAC), is developed specifically for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it is therefore readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to SOTA actor-critic algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.

ICRA Conference 2023 Conference Paper

Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policy Optimization

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Kasun Weerakoon
  • Prithvi Poddar
  • Alec Koppel
  • Pratap Tokekar
  • Dinesh Manocha

In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse rewards are common in continuous control robotics tasks such as manipulation and navigation and make the learning problem hard due to the non-trivial estimation of value functions over the state space. This demands either reward shaping or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior in the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2, Half-Cheetah, Walker-2D). We show consistent performance improvement across all tasks in terms of high average cumulative reward without requiring access to expert demonstrations. We further demonstrate that a navigation policy trained using HT-SPG can be easily transferred into a Clearpath Husky robot to perform real-world navigation tasks.

ICRA Conference 2023 Conference Paper

Decentralized Multi-agent Exploration with Limited Inter-agent Communications

  • Hans He
  • Alec Koppel
  • Amrit Singh Bedi
  • Daniel J. Stilwell
  • Mazen Farhood
  • Benjamin Biggs

We consider the problem of decentralized multiagent environmental learning through maximizing the joint information gain among a team of agents. Inspired by subsea applications where bandwidth is severely limited, we explicitly consider the challenge of restricted communication between agents. The environment is modeled as a Gaussian process (GP), and the global information gain maximization problem in a GP is a set-valued optimization problem involving all agents' locally acquired data. We develop a decentralized method to solve it based on decomposition of information gain and exchange of limited subsets of data between agents. A key technical novelty of our approach is that we formulate the incentives for information exchange among agents as a submodular set optimization problem in terms of the log-determinant of their local covariance matrices. Numerical experiments on real-world data demonstrate the ability of our algorithm to explore trade-off between objectives. In particular, we demonstrate favorable performance on mapping problems where both decentralized information gathering and limited information exchange are essential.

AAAI Conference 2023 Conference Paper

Posterior Coreset Construction with Kernelized Stein Discrepancy for Model-Based Reinforcement Learning

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Pratap Tokekar
  • Alec Koppel
  • Brian Sadler
  • Furong Huang
  • Dinesh Manocha

Model-based approaches to reinforcement learning (MBRL) exhibit favorable performance in practice, but their theoretical guarantees in large spaces are mostly restricted to the setting when transition model is Gaussian or Lipschitz, and demands a posterior estimate whose representational complexity grows unbounded with time. In this work, we develop a novel MBRL method (i) which relaxes the assumptions on the target transition model to belong to a generic family of mixture models; (ii) is applicable to large-scale training by incorporating a compression step such that the posterior estimate consists of a Bayesian coreset of only statistically significant past state-action pairs; and (iii) exhibits a sublinear Bayesian regret. To achieve these results, we adopt an approach based upon Stein's method, which, under a smoothness condition on the constructed posterior and target, allows distributional distance to be evaluated in closed form as the kernelized Stein discrepancy (KSD). The aforementioned compression step is then computed in terms of greedily retaining only those samples which are more than a certain KSD away from the previous model estimate. Experimentally, we observe that this approach is competitive with several state-of-the-art RL methodologies, and can achieve up-to 50 percent reduction in wall clock time in some continuous control environments.

NeurIPS Conference 2023 Conference Paper

Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities

  • Donghao Ying
  • Yunkai Zhang
  • Yuhao Ding
  • Alec Koppel
  • Javad Lavaei

We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by general utilities, i. e. , nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and $\kappa$-hop neighbor truncation under a form of correlation decay property, where $\kappa$ is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $\widetilde{\mathcal{O}}\left(\epsilon^{-3. 5}\right)$ samples to achieve an $\epsilon$-FOSP with an approximation error of $\mathcal{O}(\phi_0^{2\kappa})$, where $\phi_0\in (0, 1)$. Finally, we demonstrate the effectiveness of our model through extensive numerical experiments.

ICML Conference 2023 Conference Paper

STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

  • Souradip Chakraborty
  • Amrit Singh Bedi
  • Alec Koppel
  • Mengdi Wang 0001
  • Furong Huang
  • Dinesh Manocha

Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: STEin information dirEcted exploration for model-based Reinforcement LearnING. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.

AAAI Conference 2022 Conference Paper

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach

  • Qinbo Bai
  • Amrit Singh Bedi
  • Mridul Agarwal
  • Alec Koppel
  • Vaneet Aggarwal

Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve -optimal cumulative reward with feasible policies. An -feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve -optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit Õ 1/ 2 sample complexity to achieve -optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the -optimal policy with zero constraint violation is Õ 1/ 5. Hence, the proposed algorithm provides a significant improvement as compared to the state of the art.

IROS Conference 2022 Conference Paper

Distributed Riemannian Optimization with Lazy Communication for Collaborative Geometric Estimation

  • Yulun Tian
  • Amrit Singh Bedi
  • Alec Koppel
  • Miguel Calvo-Fullana
  • David M. Rosen
  • Jonathan P. How

We present the first distributed optimization al-gorithm with lazy communication for collaborative geometric estimation, the backbone of modern collaborative simultaneous localization and mapping (SLAM) and structure-from-motion (SfM) applications. Our method allows agents to cooperatively reconstruct a shared geometric model on a central server by fusing individual observations, but without the need to transmit potentially sensitive information about the agents themselves (such as their locations). Furthermore, to alleviate the burden of communication during iterative optimization, we design a set of communication triggering conditions that enable agents to selectively upload a targeted subset of local information that is useful to global optimization. Our approach thus achieves significant communication reduction with minimal impact on optimization performance. As our main theoretical contribution, we prove that our method converges to first-order critical points with a global sublinear convergence rate. Numerical evaluations on bundle adjustment problems from collaborative SLAM and SfM datasets show that our method performs competitively against existing distributed techniques, while achieving up to 78% total communication reduction.

AAAI Conference 2022 Conference Paper

Multi-Agent Reinforcement Learning with General Utilities via Decentralized Shadow Reward Actor-Critic

  • Junyu Zhang
  • Amrit Singh Bedi
  • Mengdi Wang
  • Alec Koppel

We posit a new mechanism for cooperation in multi-agent reinforcement learning (MARL) based upon any nonlinear function of the team’s long-term state-action occupancy measure, i. e. , a general utility. This subsumes the cumulative return but also allows one to incorporate risk-sensitivity, exploration, and priors. We derive the Decentralized Shadow Reward Actor-Critic (DSAC) in which agents alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor). DSAC augments the classic critic step by requiring agents to (i) estimate their local occupancy measure in order to (ii) estimate the derivative of the local utility with respect to their occupancy measure, i. e. , the “shadow reward”. DSAC converges to a stationary point in sublinear rate with high probability, depending on the amount of communications. Under proper conditions, we further establish the non-existence of spurious stationary points for this problem, that is, DSAC finds the globally optimal policy. Experiments demonstrate the merits of goals beyond the cumulative return in cooperative MARL.

ICML Conference 2022 Conference Paper

On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

  • Amrit Singh Bedi
  • Souradip Chakraborty
  • Anjaly Parayil
  • Brian M. Sadler
  • Pratap Tokekar
  • Alec Koppel

We focus on parameterized policy search for reinforcement learning over continuous action spaces. Typically, one assumes the score function associated with a policy is bounded, which {fails to hold even for Gaussian policies. } To properly address this issue, one must introduce an exploration tolerance parameter to quantify the region in which it is bounded. Doing so incurs a persistent bias that appears in the attenuation rate of the expected policy gradient norm, which is inversely proportional to the radius of the action space. To mitigate this hidden bias, heavy-tailed policy parameterizations may be used, which exhibit a bounded score function, but doing so can cause instability in algorithmic updates. To address these issues, in this work, we study the convergence of policy gradient algorithms under heavy-tailed parameterizations, which we propose to stabilize with a combination of mirror ascent-type updates and gradient tracking. Our main theoretical contribution is the establishment that this scheme converges with constant batch sizes, whereas prior works require these parameters to respectively shrink to null or grow to infinity. Experimentally, this scheme under a heavy-tailed policy parameterization yields improved reward accumulation across a variety of settings as compared with standard benchmarks.

ICML Conference 2022 Conference Paper

Sharpened Quasi-Newton Methods: Faster Superlinear Rate and Larger Local Convergence Neighborhood

  • Qiujiang Jin
  • Alec Koppel
  • Ketan Rajawat
  • Aryan Mokhtari

Non-asymptotic analysis of quasi-Newton methods have received a lot of attention recently. In particular, several works have established a non-asymptotic superlinear rate of $$\mathcal{O}((1/\sqrt{t})^t)$$ for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of the BFGS method was recently proposed which accelerates the convergence of BFGS by directly approximating the Hessian matrix, instead of Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of Greedy-BFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in Greedy-BFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of two worlds. More precisely, it leverages the approximation ideas of both BFGS and Greedy-BFGS to properly approximate both the Newton direction and the Hessian matrix. Our theoretical results show that our method out-performs both BFGS and Greedy-BFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to Greedy-BFGS. Numerical experiments on various datasets also confirm our theoretical findings.

IROS Conference 2021 Conference Paper

Wasserstein-Splitting Gaussian Process Regression for Heterogeneous Online Bayesian Inference

  • Michael E. Kepler
  • Alec Koppel
  • Amrit Singh Bedi
  • Daniel J. Stilwell

Gaussian processes (GPs) are a well-known nonparametric Bayesian inference technique, but they suffer from scalability problems for large sample sizes, and their performance can degrade for non-stationary or spatially heterogeneous data. In this work, we seek to overcome these issues through (i) employing variational free energy approximations of GPs operating in tandem with online expectation propagation steps; and (ii) introducing a local splitting step which instantiates a new GP whenever the posterior distribution changes significantly as quantified by the Wasserstein metric over posterior distributions. Over time, then, this yields an ensemble of sparse GPs which may be updated incrementally, and adapts to locality, heterogeneity, and non-stationarity in training data. We provide a 1-dimensional example to illustrate the motivation behind our approach, and compare the performance of our approach to other Gaussian process methods across various data sets, which often achieves competitive, if not superior predictive performance, relative to other locality-based GP regression methods in which hyperparameters are learned in an online manner.

JMLR Journal 2020 Journal Article

A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning

  • Aryan Mokhtari
  • Alec Koppel
  • Martin Takac
  • Alejandro Ribeiro

We consider learning problems over training sets in which both, the number of training examples and the dimension of the feature vectors, are large. To solve these problems we propose the random parallel stochastic algorithm (RAPSA). We call the algorithm random parallel because it utilizes multiple parallel processors to operate on a randomly chosen subset of blocks of the feature vector. RAPSA is doubly stochastic since each processor utilizes a random set of functions to compute the stochastic gradient associated with a randomly chosen sets of variable coordinates. Algorithms that are parallel in either of these dimensions exist, but RAPSA is the first attempt at a methodology that is parallel in both the selection of blocks and the selection of elements of the training set. In RAPSA, processors utilize the randomly chosen functions to compute the stochastic gradient component associated with a randomly chosen block. The technical contribution of this paper is to show that this minimally coordinated algorithm converges to the optimal classifier when the training objective is strongly convex. Moreover, we present an accelerated version of RAPSA (ARAPSA) that incorporates the objective function curvature information by premultiplying the descent direction by a Hessian approximation matrix. We further extend the results for asynchronous settings and show that if the processors perform their updates without any coordination the algorithms are still convergent to the optimal argument. RAPSA and its extensions are then numerically evaluated on a linear estimation problem and a binary image classification task using the MNIST handwritten digit dataset. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

IROS Conference 2020 Conference Paper

Dense Incremental Metric-Semantic Mapping via Sparse Gaussian Process Regression

  • Ehsan Zobeidi
  • Alec Koppel
  • Nikolay Atanasov 0001

We develop an online probabilistic metric-semantic mapping approach for autonomous robots relying on streaming RGB-D observations. We cast this problem as a Bayesian inference task, requiring encoding both the geometric surfaces and semantic labels (e. g. , chair, table, wall) of the unknown environment. We propose an online Gaussian Process (GP) training and inference approach, which avoids the complexity of GP classification by regressing a truncated signed distance function representation of the regions occupied by different semantic classes. Online regression is enabled through sparse GP approximation, compressing the training data to a finite set of inducing points, and through spatial domain partitioning into an Octree data structure with overlapping leaves. Our experiments demonstrate the effectiveness of this technique for large-scale probabilistic metric-semantic mapping of 3D environments. A distinguishing feature of our approach is that the generated maps contain full continuous distributional information about the geometric surfaces and semantic labels, making them appropriate for uncertainty-aware planning.

NeurIPS Conference 2020 Conference Paper

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

  • Junyu Zhang
  • Alec Koppel
  • Amrit Singh Bedi
  • Csaba Szepesvari
  • Mengdi Wang

In recent years, reinforcement learning systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. Further, we prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, and we also establish its rate of convergence that matches or improves the convergence rate available in the case of RL with cumulative rewards.

JMLR Journal 2019 Journal Article

Parsimonious Online Learning with Kernels via Sparse Projections in Function Space

  • Alec Koppel
  • Garrett Warnell
  • Ethan Stump
  • Alejandro Ribeiro

Despite their attractiveness, popular perception is that techniques for nonparametric function approximation do not scale to streaming data due to an intractable growth in the amount of storage they require. To solve this problem in a memory-affordable way, we propose an online technique based on functional stochastic gradient descent in tandem with supervised sparsification based on greedy function subspace projections. The method, called parsimonious online learning with kernels (POLK), provides a controllable tradeoff between its solution accuracy and the amount of memory it requires. We derive conditions under which the generated function sequence converges almost surely to the optimal function, and we establish that the memory requirement remains finite. We evaluate POLK for kernel multi-class logistic regression and kernel hinge-loss classification on three canonical data sets: a synthetic Gaussian mixture model, the MNIST hand-written digits, and the Brodatz texture database. On all three tasks, we observe a favorable trade-off of objective function evaluation, classification performance, and complexity of the nonparametric regressor extracted by the proposed method. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

IROS Conference 2018 Conference Paper

Composable Learning with Sparse Kernel Representations

  • Ekaterina I. Tolstaya
  • Ethan Stump
  • Alec Koppel
  • Alejandro Ribeiro

We present a reinforcement learning algorithm for learning sparse non-parametric controllers in a Reproducing Kernel Hilbert Space. We improve the sample complexity of this approach by imposing a structure of the state-action function through a normalized advantage function (NAF). This representation of the policy enables efficiently composing multiple learned models without additional training samples or interaction with the environment. We demonstrate the performance of this algorithm on learning obstacle-avoidance policies in multiple simulations of a robot equipped with a laser scanner while navigating in a 2D environment. We apply the composition operation to various policy combinations and test them to show that the composed policies retain the performance of their components. We also transfer the composed policy directly to a physical platform operating in an arena with obstacles in order to demonstrate a degree of generalization.

IROS Conference 2016 Conference Paper

Online learning for characterizing unknown environments in ground robotic vehicle models

  • Alec Koppel
  • Jonathan Fink
  • Garrett Warnell
  • Ethan Stump
  • Alejandro Ribeiro

In pursuit of increasing the operational tempo of a ground robotics platform in unknown domains, we consider the problem of predicting the distribution of structural state-estimation error due to poorly-modeled platform dynamics as well as environmental effects. Such predictions are a critical component of any modern control approach that utilizes uncertainty information to provide robustness in control design. We use an online learning algorithm based on matrix factorization techniques to fit a statistical model of error that provides enough expressive power to enable prediction directly from motion control signals and low-level visual features. Moreover, we empirically demonstrate that this technique compares favorably to predictors that do not incorporate this information.

IROS Conference 2015 Conference Paper

D4L: Decentralized dynamic discriminative dictionary learning

  • Alec Koppel
  • Garrett Warnell
  • Ethan Stump
  • Alejandro Ribeiro

We consider discriminative dictionary learning in a distributed online setting, where a team of networked robots aims to jointly learn both a common basis of the feature space and a classifier over this basis from sequentially observed signals. We formulate this problem as a distributed stochastic program with a non-convex objective and present a block variant of the Arrow-Hurwicz saddle point algorithm to solve it. Only neighboring nodes in the communications network need to exchange information, and we penalize the discrepency between the individual feature basis and classifiers using Lagrange multipliers. The application we consider is for a team of robots to collaboratively recognize objects of interest in dynamic environments. As a preliminary performance benchmark, we consider the problem of learning a texture classifier across a network of robots moving around an urban setting where separate training examples are sequentially observed at each robot. Results are shown for both a standard texture dataset and a new dataset from an urban training facility, and we compare the performance of the standard centralized construction to the new distributed algorithm for the case when distinct samples from all classes are seen by the robots. These experiments yield comparable performance between the decentralized and the centralized cases, demonstrating the proposed method's practical utility.