Author name cluster

Sujay Bhatt

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

RLC Conference 2025 Conference Paper

Building Sequential Resource Allocation Mechanisms without Payments

Sihan Zeng
Sujay Bhatt
Alec Koppel
Sumitra Ganesh

We study allocating divisible resources of limited quantities to agents who submit requests for the resources one or multiple times over a finite horizon. This is referred to as the sequential or online resource allocation problem, as irrevocable allocations need to be made as the requests arrive, without observations on the future requests. The existing work on sequential resource allocation (in the payment-free setting) mainly focuses on optimizing social welfare and designs mechanisms under the assumption that the agents make truthful requests. Such mechanisms can be easily exploitable -- strategic agents may misreport their requests to inflate their allocations. Our aim in this work is to design sequential resource allocation mechanisms that balance the competing objectives of social welfare maximization (promoting the overall agent satisfaction) and incentive compatibility (ensuring that the agents do not have incentives to misreport). We do not design these mechanisms from scratch. Instead, as the incentive compatible mechanism design problem has been well studied in the one-shot setting (horizon length equals one), we propose a general meta-algorithm of transforming a one-shot mechanism into its sequential counterpart. The meta-algorithm can plug in any one-shot mechanism and approximately carry over the properties that the one-shot mechanism already satisfies to the sequential setting. We establish theoretical results validating these claims and illustrate their superior performance relative to baselines in experiments.

PDF Details

RLJ Journal 2025 Journal Article

Building Sequential Resource Allocation Mechanisms without Payments

Sihan Zeng
Sujay Bhatt
Alec Koppel
Sumitra Ganesh

PDF Details

ICLR Conference 2025 Conference Paper

Collab: Controlled Decoding using Mixture of Agents for LLM Alignment

Souradip Chakraborty
Sujay Bhatt
Udari Madhushani
Soumya Suvra Ghosal
Jiahao Qiu
Mengdi Wang 0001
Dinesh Manocha
Furong Huang

Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences, and broader utilities, but it requires updating billions of model parameters which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agents-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward, for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, COLLAB surpasses the current SoTA decoding strategy, achieving an improvement of {up to 1.56x} in average reward and $71.89\%$ in GPT-4 based win-tie rate.

Details

AAAI Conference 2025 Conference Paper

Decentralized Convergence to Equilibrium Prices in Trading Networks

Edwin Lock
Benjamin Patrick Evans
Eleonora Kreacic
Sujay Bhatt
Alec Koppel
Sumitra Ganesh
Paul W. Goldberg

We propose a decentralized market model in which agents can negotiate bilateral contracts. This builds on a similar, but centralized, model of trading networks introduced by Hatfield et al. in 2013. Prior work has established that fully-substitutable preferences guarantee the existence of competitive equilibria which can be centrally computed. Our motivation comes from the fact that prices in markets such as over-the-counter markets and used car markets arise from decentralized negotiation among agents, which has left open an important question as to whether equilibrium prices can emerge from agent-to-agent bilateral negotiations. We design a best response dynamic intended to capture such negotiations between market participants. We assume fully substitutable preferences for market participants. In this setting, we provide proofs of convergence for sparse markets (covering many real world markets of interest), and experimental results for more general cases, demonstrating that prices indeed reach equilibrium, quickly, via bilateral negotiations. Our best response dynamic, and its convergence behavior, forms an important first step in understanding how decentralized markets reach, and retain, equilibrium.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis

Sihan Zeng
Benjamin Patrick Evans
Sujay Bhatt
Leo Ardon
Sumitra Ganesh
Alec Koppel

We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi-level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader’s and followers’ objectives, use samples inefficiently due to nested-loop algorithm structure, and lack finite-time convergence guarantees. To address these limitations, we propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi-)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees. Our key assumption is a "gradient alignment" condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader-follower independence assumption. Simulation results in a range of well-established economics environments demonstrate that AC-SMFG outperforms existing multi-agent and MFG learning baselines in policy quality and convergence speed.

PDF Details

ICML Conference 2024 Conference Paper

Information-Directed Pessimism for Offline Reinforcement Learning

Alec Koppel
Sujay Bhatt
Jiacheng Guo
Joe Eappen
Mengdi Wang 0001
Sumitra Ganesh

Policy optimization from batch data, i. e. , offline reinforcement learning (RL) is important when collecting data from a current policy is not possible. This setting incurs distribution mismatch between batch training data and trajectories from the current policy. Pessimistic offsets estimate mismatch using concentration bounds, which possess strong theoretical guarantees and simplicity of implementation. Mismatch may be conservative in sparse data regions and less so otherwise, which can result in under-performing their no-penalty variants in practice. We derive a new pessimistic penalty as the distance between the data and the true distribution using an evaluable one-sample test known as Stein Discrepancy that requires minimal smoothness conditions, and noticeably, allows a mixture family representation of distribution over next states. This entity forms a quantifier of information in offline data, which justifies calling this approach information-directed pessimism (IDP) for offline RL. We further establish that this new penalty based on discrete Stein discrepancy yields practical gains in performance while generalizing the regret of prior art to multimodal distributions.

Details

TMLR Journal 2024 Journal Article

Regularized Proportional Fairness Mechanism for Resource Allocation Without Money

Sihan Zeng
Sujay Bhatt
Alec Koppel
Sumitra Ganesh

Mechanism design in resource allocation studies dividing limited resources among self-interested agents whose satisfaction with the allocation depends on privately held utilities. We consider the problem in a payment-free setting, with the aim of maximizing social welfare while enforcing incentive compatibility (IC), i.e., agents cannot inflate allocations by misreporting their utilities. The well-known proportional fairness (PF) mechanism achieves the maximum possible social welfare but incurs an undesirably high exploitability (the maximum unilateral inflation in utility from misreport and a measure of deviation from IC). In fact, it is known that no mechanism can achieve the maximum social welfare and exact incentive compatibility (IC) simultaneously without the use of monetary incentives (Cole et al., 2013). Motivated by this fact, we propose learning an approximate mechanism that desirably trades off the competing objectives. Our main contribution is to design an innovative neural network architecture tailored to the resource allocation problem, which we name Regularized Proportional Fairness Network (RPF-Net). RPF-Net regularizes the output of the PF mechanism by a learned function approximator of the most exploitable allocation, with the aim of reducing the incentive for any agent to misreport. We derive generalization bounds that guarantee the mechanism performance when trained under finite and out-of-distribution samples and experimentally demonstrate the merits of the proposed mechanism compared to the state-of-the-art. The PF mechanism acts as an important benchmark for comparing the social welfare of any mechanism. However, there exists no established way of computing its exploitability. The challenge here is that we need to find the maximizer of an optimization problem for which the gradient is only implicitly defined. We for the first time provide a systematic method for finding such (sub)gradients, which enables the evaluation of the exploitability of the PF mechanism through iterative (sub)gradient ascent.

PDF Details

ICML Conference 2022 Conference Paper

Minimax M-estimation under Adversarial Contamination

Sujay Bhatt
Guanhua Fang
Ping Li 0001
Gennady Samorodnitsky

We present a new finite-sample analysis of Catoni’s M-estimator under adversarial contamination, where an adversary is allowed to corrupt a fraction of the samples arbitrarily. We make minimal assumptions on the distribution of the uncontaminated random variables, namely, we only assume the existence of a known upper bound $\upsilon_{\varepsilon} > 0$ on the $(1+\varepsilon)^{th}$ central moment of the random variables, namely, for $\varepsilon \in (0, 1]$ \[ \mathbb{E}_{X_1 \sim \mathcal{D}} \Big| X_1 - \mu \Big|^{1+\varepsilon} \leq \upsilon_{\varepsilon}. \]{We} provide a lower bound on the minimax error rate for the mean estimation problem under adversarial corruption under this weak assumption, and establish that the proposed M-estimator achieves this lower bound (up to multiplicative constants). When the variance is infinite, the tolerance to contamination of any estimator reduces as $\varepsilon \downarrow 0$. We establish a tight upper bound that characterizes this bargain. To illustrate the usefulness of the derived robust M-estimator in an online setting, we present a bandit algorithm for the partially identifiable best arm identification problem that improves upon the sample complexity of the state of the art algorithms.

Details

ICML Conference 2022 Conference Paper

Nearly Optimal Catoni's M-estimator for Infinite Variance

Sujay Bhatt
Guanhua Fang
Ping Li 0001
Gennady Samorodnitsky

In this paper, we extend the remarkable M-estimator of Catoni \citep{Cat12} to situations where the variance is infinite. In particular, given a sequence of i. i. d random variables $\{X_i\}_{i=1}^n$ from distribution $\mathcal{D}$ over $\mathbb{R}$ with mean $\mu$, we only assume the existence of a known upper bound $\upsilon_{\varepsilon} > 0$ on the $(1+\varepsilon)^{th}$ central moment of the random variables, namely, for $\varepsilon \in (0, 1]$ \[ \mathbb{E}_{X_1 \sim \mathcal{D}} \Big| X_1 - \mu \Big|^{1+\varepsilon} \leq \upsilon_{\varepsilon}. \]{The} extension is non-trivial owing to the difficulty in characterizing the roots of certain polynomials of degree smaller than $2$. The proposed estimator has the same order of magnitude and the same asymptotic constant as in \citet{Cat12}, but for the case of bounded moments. We further propose a version of the estimator that does not require even the knowledge of $\upsilon_{\varepsilon}$, but adapts the moment bound in a data-driven manner. Finally, to illustrate the usefulness of the derived non-asymptotic confidence bounds, we consider an application in multi-armed bandits and propose best arm identification algorithms, in the fixed confidence setting, that outperform the state of the art.

Details

UAI Conference 2022 Conference Paper

Offline change detection under contamination

Sujay Bhatt
Guanhua Fang
Ping Li 0001

In this work, we propose a non-parametric and robust change detection algorithm to detect multiple change points in time series data under non-adversarial contamination. The algorithm is designed for the offline setting, where the objective is to detect changes when all data are received. We only make weak moment assumptions on the inliers (uncorrupted data) to handle a large class of distributions. The robust scan statistic in the change detection algorithm is fashioned using mean estimators based on influence functions. We establish the consistency of the estimated change point indexes as the number of samples increases, and provide empirical evidence to support the consistency results.

Details