Author name cluster

Xuchuang Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

2 author rows

AAAI Conference 2026 Conference Paper

A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses

Xiangxiang Dai
Yuejin Xie
Maoli Liu
Xuchuang Wang
Zhuohua Li
Huanyu Wang
John C.S. Lui

Prompt-based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high-dimensional features, large response sets, adaptive conversational needs, and multi-device access, we propose MACO, Multi-Agent Conversational Online Learning, which comprises two key components: (1) MACO-A: Executed by local agents, it employs an online elimination mechanism to filter out low-quality responses. (2) MACO-S: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An adaptive preference mechanism triggers asynchronous conversations to enhance alignment efficiency. Theoretical analysis demonstrates that MACO achieves near-optimal regret bounds, matching state-of-the-art performance in various degenerate cases. Extensive experiments utilizing Google and OpenAI text embedding models on the real-world datasets with different response styles, combined with Llama and GPT-4o, show that MACO consistently outperforms baseline methods by at least 8.29% across varying response set sizes and numbers of agents.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Federated Multi-armed Bandits with Efficient Bit-Level Communications

Haoran Zhang
Yang Xu
Xuchuang Wang
Hao-Xu Chen
Hao Qiu
Lin Yang
Yang Gao

In this work, we study the federated multi-armed bandit (FMAB) problem, where a set of distributed agents collaboratively aim to minimize cumulative regret while interacting with a shared set of arms. Unlike traditional centralized bandit models, agents in FMAB settings are connected via a communication graph and cannot share data freely due to bandwidth limitations or privacy constraints. This raises a fundamental challenge: how to achieve optimal learning performance under stringent communication budgets. We propose a novel communication-efficient algorithm that decouples the learning process into two phases: one for eliminating suboptimal arms through early and frequent communication of key decisions, and another for refining global estimates using buffered, quantized, and differentially transmitted statistics. By carefully balancing the communication frequency and precision of shared information, our algorithm achieves the optimal individual regret bound $O(N^{-1}\log T)$ while significantly reducing the total number of communication rounds and transmitted bits. Theoretically, we derive tight upper bounds on both individual cumulative regret and group regret, and prove that our method asymptotically matches the lower bound of regret in federated settings. Experimental results on synthetic data validate the effectiveness of the proposed approach in various graph topologies and under heterogeneous feedback.

PDF Details

ICML Conference 2025 Conference Paper

Fusing Reward and Dueling Feedback in Stochastic Bandits

Xuchuang Wang
Qirun Zeng
Jinhang Zuo
Xutong Liu 0002
Mohammad Hajiesmaili
John C. S. Lui
Adam Wierman

This paper investigates the fusion of absolute (reward) and relative (dueling) feedback in stochastic bandits, where both feedback types are gathered in each decision round. We derive a regret lower bound, demonstrating that an efficient algorithm may incur only the smaller among the reward and dueling-based regret for each individual arm. We propose two fusion approaches: (1) a simple elimination fusion algorithm that leverages both feedback types to explore all arms and unifies collected information by sharing a common candidate arm set, and (2) a decomposition fusion algorithm that selects the more effective feedback to explore the corresponding arms and randomly assigns one feedback type for exploration and the other for exploitation in each round. The elimination fusion experiences a suboptimal multiplicative term of the number of arms in regret due to the intrinsic suboptimality of dueling elimination. In contrast, the decomposition fusion achieves regret matching the lower bound up to a constant under a common assumption. Extensive experiments confirm the efficacy of our algorithms and theoretical results.

Details

AAAI Conference 2025 Conference Paper

Heterogeneous Multi-Agent Bandits with Parsimonious Hints

Amirmahdi Mirfakhar
Xuchuang Wang
Jinhang Zuo
Yair Zick
Mohammad Hajiesmaili

We study a hinted heterogeneous multi-agent multi-armed bandits problem (HMA2B), where agents can query low-cost observations (hints) in addition to pulling arms. In this framework, each of the M agents has a unique reward distribution over K arms, and in T rounds, they can observe the reward of the arm they pull only if no other agent pulls that arm. The goal is to maximize the total utility by querying the minimal necessary hints without pulling arms, achieving time-independent regret. We study HMA2B in both centralized and decentralized setups. Our main centralized algorithm, GP-HCLA, which is an extension of HCLA, uses a central decision-maker for arm-pulling and hint queries, achieving O(M^4 K) regret with O(M K log T) adaptive hints. In decentralized setups, we propose two algorithms, HD-ETC and EBHD-ETC, that allow agents to choose actions independently through collision-based communication and query hints uniformly until stopping, yielding O(M^3 K^2) regret with O(M^3 K log T) hints, where the former requires knowledge of the minimum gap and the latter does not. Finally, we establish lower bounds to prove the optimality of our results and verify them through numerical simulations.

PDF Details DOI

UAI Conference 2025 Conference Paper

Near-Optimal Regret Bounds for Federated Multi-armed Bandits with Fully Distributed Communication

Haoran Zhang
Xuchuang Wang
Hao-Xu Chen
Hao Qiu
Lin Yang 0013
Yang Gao

In this paper, we focus on the research of federated multi-armed bandit (FMAB) problems where agents can only communicate with their neighbors. All agents aim to solve a common multi-armed bandit (MAB) problem to minimize individual regrets, while group regret can also be minimized. In a federated bandit problem, an agent fails to estimate the global reward means of arms by only using local observations, and hence, the bandit learning algorithm usually adopts a consensus estimation strategy to address the heterogeneity. However, up to now, the existing algorithms with fully distributed communication graphs only achieved a suboptimal result for the problem. To address that, a fully distributed online consensus estimation algorithm (\texttt{CES}) is proposed to estimate the global mean without bias. Integrating this consensus estimator into a distributed successive elimination bandit algorithm framework yields our federated bandit algorithm. Our algorithm significantly improves both individual and group regrets over previous approaches, and we provide an in-depth analysis of the lower bound for this problem.

Details

AAAI Conference 2025 Conference Paper

Quantum Best Arm Identification with Quantum Oracles

Xuchuang Wang
Yu-Zhen Janice Chen
Matheus Guedes de Andrade
Jonathan Allcock
Mohammad Hajiesmaili
John C.S. Lui
Don Towsley

Best arm identification (BAI) is a key problem in stochastic multi-armed bandits, where K arms each has an associated reward distribution, and the objective is to minimize the number of queries needed to identify the best arm with high confidence. In this paper, we explore BAI using quantum oracles. For the case where each query probes only one arm (m=1), we devise a quantum algorithm with a query complexity upper bound of O((K/Delta)log(1/delta)), where delta is the confidence parameter and Delta is the reward gap between best and second best arms. This improves on the classical bound by a factor of 1/Delta. For the general case where a single query can probe m arms (1<= m<= K) simultaneously, we propose an algorithm with an upper bound of O((K/(Delta x sqrt(m))) log(1/delta)), improving by a factor of sqrt(m) compared to the m=1 case. We also provide query complexity lower bounds for both scenarios, which match the upper bounds up to logarithmic factors, and validate our theoretical results with Qiskit-based simulations.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Stochastic Bandits Robust to Adversarial Attacks

Xuchuang Wang
Maoli Liu
Jinhang Zuo
Xutong Liu 0002
John C. S. Lui
Mohammad Hajiesmaili

This paper investigates stochastic multi-armed bandit algorithms that are robust to adversarial attacks, where an attacker can first observe the learner's action and *then* alter their reward observation. We study two cases of this model, with or without the knowledge of an attack budget $C$, defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative $C$ dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of ${O}((K/\Delta)\log T + KC)$ and $\tilde{O}(\sqrt{KTC})$ for the additive and multiplicative $C$ terms, respectively, where $K$ is the number of arms, $T$ is the time horizon, $\Delta$ is the gap between the expected rewards of the optimal arm and the second-best arm, and $\tilde{O}$ hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of $\tilde{O}(\sqrt{KT} + KC^2)$ and $\tilde{O}(KC\sqrt{T})$ for the additive and multiplicative $C$ terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models.

Details

ICML Conference 2024 Conference Paper

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Xutong Liu 0002
Siwei Wang 0002
Jinhang Zuo
Han Zhong
Xuchuang Wang
Zhiyong Wang
Shuai Li 0010
Mohammad Hajiesmaili

We introduce a novel framework of combinatorial multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT), where the outcome of each arm is a $d$-dimensional multivariant random variable and the feedback follows a general arm triggering process. Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. For CMAB-MT, we propose a general 1-norm multivariant and triggering probability-modulated smoothness condition, and an optimistic CUCB-MT algorithm built upon this condition. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution, all of which meet the above smoothness condition and achieve matching or improved regret bounds compared to existing works. Through our new framework, we build the first connection between the episodic RL and CMAB literature, by offering a new angle to solve the episodic RL through the lens of CMAB, which may encourage more interactions between these two important directions.

Details

ICLR Conference 2023 Conference Paper

Achieving Near-Optimal Individual Regret & Low Communications in Multi-Agent Bandits

Xuchuang Wang
Lin Yang 0013
Yu-Zhen Janice Chen
Xutong Liu 0002
Mohammad Hajiesmaili
Don Towsley
John C. S. Lui

Cooperative multi-agent multi-armed bandits (CM2AB) study how distributed agents cooperatively play the same multi-armed bandit game. Most existing CM2AB works focused on maximizing the group performance of all agents---the accumulation of all agents' individual performance (i.e., individual reward). However, in many applications, the performance of the system is more sensitive to the ``bad'' agent---the agent with the worst individual performance. For example, in a drone swarm, a ``bad'' agent may crash into other drones and severely degrade the system performance. In that case, the key of the learning algorithm design is to coordinate computational and communicational resources among agents so to optimize the individual learning performance of the ``bad'' agent. In CM2AB, maximizing the group performance is equivalent to minimizing the group regret of all agents, and minimizing the individual performance can be measured by minimizing the maximum (worst) individual regret among agents. Minimizing the maximum individual regret was largely ignored in prior literature, and currently, there is little work on how to minimize this objective with a low communication overhead. In this paper, we propose a near-optimal algorithm on both individual and group regrets, in addition, we also propose a novel communication module in the algorithm, which only needs $O(\log (\log T))$ communication times where $T$ is the number of decision rounds. We also conduct simulations to illustrate the advantage of our algorithm by comparing it to other known baselines.

Details

UAI Conference 2023 Conference Paper

Exploration for Free: How Does Reward Heterogeneity Improve Regret in Cooperative Multi-agent Bandits?

Xuchuang Wang
Lin Yang 0013
Yu-Zhen Janice Chen
Xutong Liu 0002
Mohammad Hajiesmaili
Don Towsley
John C. S. Lui

This paper studies a cooperative multi-agent bandit scenario in which the rewards observed by agents are heterogeneous—one agent’s meat can be another agent’s poison. Specifically, the total reward observed by each agent is the sum of two values: an arm-specific reward, capturing the intrinsic value of the arm, and a privately-known agent-specific reward, which captures the personal preference/limitations of the agent. This heterogeneity in total reward leads to different local optimal arms for agents but creates an opportunity for \textit{free exploration} in a cooperative setting—an agent can freely explore its local optimal arm with no regret and share this free observation with some other agents who would suffer regrets if they pull this arm since the arm is not optimal for them. We first characterize a regret lower bound that captures free exploration, i. e. , arms that can be freely explored have no contribution to the regret lower bound. Then, we present a cooperative bandit algorithm that takes advantage of free exploration and achieves a near-optimal regret upper bound which tightly matches the regret lower bound up to a constant factor. Lastly, we run numerical simulations to compare our algorithm with various baselines without free exploration.

Details

NeurIPS Conference 2023 Conference Paper

Multi-Fidelity Multi-Armed Bandits Revisited

Xuchuang Wang
Qingyun Wu
Wei Chen
John C. S. Lui

We study the multi-fidelity multi-armed bandit ($\texttt{MF-MAB}$), an extension of the canonical multi-armed bandit (MAB) problem. $\texttt{MF-MAB}$ allows each arm to be pulled with different costs (fidelities) and observation accuracy. We study both the best arm identification with fixed confidence ($\texttt{BAI}$) and the regret minimization objectives. For $\texttt{BAI}$, we present (a) a cost complexity lower bound, (b) an algorithmic framework with two alternative fidelity selection procedures, and (c) both procedures' cost complexity upper bounds. From both cost complexity bounds of $\texttt{MF-MAB}$, one can recover the standard sample complexity bounds of the classic (single-fidelity) MAB. For regret minimization of $\texttt{MF-MAB}$, we propose a new regret definition, prove its problem-independent regret lower bound $\Omega(K^{1/3}\Lambda^{2/3})$ and problem-dependent lower bound $\Omega(K\log \Lambda)$, where $K$ is the number of arms and $\Lambda$ is the decision budget in terms of cost, and devise an elimination-based algorithm whose worst-cost regret upper bound matches its corresponding lower bound up to some logarithmic terms and, whose problem-dependent bound matches its corresponding lower bound in terms of $\Lambda$.

PDF Details

EWRL Workshop 2023 Workshop Paper

On-Demand Communication for Asynchronous Multi-Agent Bandits

Yu-Zhen Janice Chen
Lin Yang
Xuchuang Wang
Xutong Liu
Mohammad Hajiesmaili
John C. S. Lui
Don Towsley

This paper studies a cooperative multi-agent multi-armed stochastic bandit problem where agents operate $\textit{asynchronously}$ -- agent pull times and rates are unknown, irregular, and heterogeneous -- and face the same instance of a $K$-armed bandit problem. Agents can share reward information to speed up the learning process at additional communication costs. We propose $\texttt{ODC}$, an on-demand communication protocol that tailors the communication of each pair of agents based on their empirical pull times. $\texttt{ODC}$ is efficient when the pull times of agents are highly heterogeneous, and its communication complexity depends on the empirical pull times of agents. $\texttt{ODC}$ is a generic protocol that can be integrated into most cooperative bandit algorithms without degrading their performance. We then incorporate $\texttt{ODC}$ into the natural extensions of $\texttt{UCB}$ and $\texttt{AAE}$ algorithms and propose two communication-efficient cooperative algorithms. Our analysis shows that both algorithms are near-optimal in regret.

PDF

IJCAI Conference 2022 Conference Paper

Multi-Player Multi-Armed Bandits with Finite Shareable Resources Arms: Learning Algorithms & Applications

Xuchuang Wang
Hong Xie
John C. S. Lui

Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension of the collision and non-collision settings. Each shareable arm has finite shareable resources and a “per-load” reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the “per-load” reward multiplied by the minimum between the number of players pulling the arm and the arm’s maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms’ performance and show their utilities in wireless networking and edge computing.

PDF Details DOI

ICML Conference 2022 Conference Paper

Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms

Xuchuang Wang
Hong Xie 0004
John C. S. Lui

We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arms setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a “per-load” reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the “per-load” reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the “per-load” reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound’s first term is the same as regret lower bound’s, and its second and third terms also evidently correspond to lower bound’s. Extensive experiments validate our algorithm’s performance and also its gain in 5G & 4G base station selection.

Details