Author name cluster

Shuai Li 0010

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

1 author row

ICLR Conference 2025 Conference Paper

Bandit Learning in Matching Markets with Indifference

Fang Kong 0002
Jingqi Tang
Mingzhu Li
Pinyan Lu
John C. S. Lui
Shuai Li 0010

A rich line of recent works studies how participants in matching markets learn their unknown preferences through iterative interactions with each other. The two sides of participants in the market can be respectively formulated as players and arms in the bandit problem. To ensure market stability, the objective is to minimize the stable regret of each player. Though existing works provide significant theoretical upper bounds for players' stable regret, the results heavily rely on the assumption that each participant has a strict preference ranking. However, in real applications, multiple candidates (e.g., workers in the labor market and students in school admission) usually demonstrate comparable performance levels, making it challenging for participants (e.g., employers and schools) to differentiate and rank their preferences. To deal with the potential indifferent preferences, we propose an adaptive exploration algorithm based on arm-guided Gale-Shapley (AE-AGS). We show that its stable regret is of order $O(NK \log T / \Delta^2)$, where $N$ is the number of players, $K$ the number of arms, $T$ the total time horizon, and $\Delta$ the minimum non-zero preference gap. Extensive experiments demonstrate the algorithm's effectiveness in handling such complex situations and its consistent superiority over baselines.

ICML Conference 2025 Conference Paper

Improved Discretization Complexity Analysis of Consistency Models: Variance Exploding Forward Process and Decay Discretization Scheme

Ruofeng Yang
Bo Jiang
Cheng Chen 0015
Shuai Li 0010

Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation.

ICML Conference 2025 Conference Paper

Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Canzhe Zhao
Yutian Cheng
Jing Dong 0008
Baoxiang Wang 0001
Shuai Li 0010

We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

ICML Conference 2025 Conference Paper

Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Xiao Huang
Xu Liu
Enze Zhang
Tong Yu 0001
Shuai Li 0010

Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent’s stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.

ICML Conference 2025 Conference Paper

Optimal Algorithm for Max-Min Fair Bandit

Zilong Wang 0010
Zhiyao Zhang
Shuai Li 0010

Multi-player multi-armed bandit (MP-MAB) has been widely studied owing to its diverse applications across numerous domains. We consider an MP-MAB problem where $N$ players compete for $K$ arms in $T$ rounds. The reward distributions are heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero rewards. In this paper, our target is to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing max-min regret upper bound of $O(\exp(1/\Delta) + K^3 \log T\log \log T)$. More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bound of $O((N^2+K)\log T / \Delta)$, where $\Delta$ is the minimum reward gap between max-min value and sub-optimal arms. In addition, this paper also provides an $\Omega(\max\\{ N^2, K \\} \log T / \Delta)$ regret lower bound for this problem, which indicates that our algorithm is optimal with respect to key parameters $T, N, K$, and $\Delta$. Additional numerical experiments also show the efficiency and improvement of our algorithms.

UAI Conference 2025 Conference Paper

Towards Provably Efficient Learning of Imperfect Information Extensive-Form Games with Linear Function Approximation

Canzhe Zhao
Shuze Chen
Weiming Liu 0004
Haobo Fu
Qiang Fu 0016
Shuai Li 0010

Despite significant advances in learning imperfect information extensive-form games (IIEFGs), most existing theoretical guarantees are limited to IIEFGs in the tabular case. To permit efficient learning of large-scale IIEFGs, we take the first step in studying two-player zero-sum IIEFGs with linear function approximation. In particular, we consider linear IIEFGs in the formulation of partially observable Markov games (POMGs) with linearly parameterized rewards. To address the challenge that the underlying function approximation structure is difficult to directly apply due to the imperfect information of states, we construct the composite "feature vectors" for information set-action pairs. Based on this, we further propose a "least-squares loss estimator", which we call the *fictitious* least-squares loss estimator. Through integrating this estimator with the follow-the-regularized-leader (FTRL) framework, we propose the *fictitious* least-squares follow-the-regularized-leader ($\text{F}^2\text{TRL}$) algorithm, which achieves a provable $\widetilde{\mathcal{O}}(\lambda\sqrt{d H^2 T})$ regret guarantee in the large $T$ regime, where $d$ is the ambient dimension of the feature mapping, $H$ is the horizon length, $\lambda$ is a "balance coefficient" and $T$ is the number of episodes. At the core of the analysis of $\text{F}^2\text{TRL}$ is the leverage of our proposed new "balanced transition" over information set-action space. Additionally, we complement our results with an $\Omega(\sqrt{d\min(d, H)T})$ regret lower bound for this problem and conduct empirical evaluations across various environments, which corroborate the effectiveness of our algorithm.

ICML Conference 2024 Conference Paper

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Xutong Liu 0002
Siwei Wang 0002
Jinhang Zuo
Han Zhong
Xuchuang Wang
Zhiyong Wang
Shuai Li 0010
Mohammad Hajiesmaili

We introduce a novel framework of combinatorial multi-armed bandits (CMAB) with multivariant and probabilistically triggering arms (CMAB-MT), where the outcome of each arm is a $d$-dimensional multivariant random variable and the feedback follows a general arm triggering process. Compared with existing CMAB works, CMAB-MT not only enhances the modeling power but also allows improved results by leveraging distinct statistical properties for multivariant random variables. For CMAB-MT, we propose a general 1-norm multivariant and triggering probability-modulated smoothness condition, and an optimistic CUCB-MT algorithm built upon this condition. Our framework can include many important problems as applications, such as episodic reinforcement learning (RL) and probabilistic maximum coverage for goods distribution, all of which meet the above smoothness condition and achieve matching or improved regret bounds compared to existing works. Through our new framework, we build the first connection between the episodic RL and CMAB literature, by offering a new angle to solve the episodic RL through the lens of CMAB, which may encourage more interactions between these two important directions.

ICML Conference 2024 Conference Paper

In-context Learning on Function Classes Unveiled for Transformers

Zhijie Wang
Bo Jiang
Shuai Li 0010

Transformer-based neural sequence models exhibit a remarkable ability to perform in-context learning. Given some training examples, a pre-trained model can make accurate predictions on an unseen input. This paper studies why transformers can learn different types of function classes in-context. We first show by construction that there exists a family of transformers (with different activation functions) that implement approximate gradient descent on the parameters of neural networks, and we provide an upper bound for the number of heads, hidden dimensions, and layers of the transformer. We also show that a transformer can learn linear functions, the indicator function of a unit ball, and smooth functions in-context by learning neural networks that approximate them. The above instances mainly focus on a transformer pre-trained on single tasks. We also prove that when pre-trained on two tasks: linear regression and classification, a transformer can make accurate predictions on both tasks simultaneously. Our results move beyond linearity in terms of in-context learning instances and provide a comprehensive understanding of why transformers can learn many types of function classes through the bridge of neural networks.

ICLR Conference 2024 Conference Paper

On Stationary Point Convergence of PPO-Clip

Ruinan Jin
Shuai Li 0010
Baoxiang Wang 0001

Proximal policy optimization (PPO) has gained popularity in reinforcement learning (RL). Its PPO-Clip variant is one the most frequently implemented algorithms and is one of the first-to-try algorithms in RL tasks. This variant uses a clipped surrogate objective function not typically found in other algorithms. Many works have demonstrated the practical performance of PPO-Clip, but the theoretical understanding of it is limited to specific settings. In this work, we provide a comprehensive analysis that shows the stationary point convergence of PPO-Clip and the convergence rate thereof. Our analysis is new and overcomes many challenges, including the non-smooth nature of the clip operator, the potentially unbounded score function, and the involvement of the ratio of two stochastic policies. Our results and techniques might share new insights into PPO-Clip.

ICML Conference 2023 Conference Paper

Future-conditioned Unsupervised Pretraining for Decision Transformer

Zhihui Xie 0002
Zichuan Lin
Deheng Ye
Qiang Fu 0016
Yang Wei
Shuai Li 0010

Recent research in offline reinforcement learning (RL) has demonstrated that return-conditioned supervised learning is a powerful paradigm for decision-making problems. While promising, return conditioning is limited to training data labeled with rewards and therefore faces challenges in learning from unsupervised data. In this work, we aim to utilize generalized future conditioning to enable efficient unsupervised pretraining from reward-free and sub-optimal offline data. We propose Pretrained Decision Transformer (PDT), a conceptually simple approach for unsupervised RL pretraining. PDT leverages future trajectory information as a privileged context to predict actions during training. The ability to make decisions based on both present and future factors enhances PDT’s capability for generalization. Besides, this feature can be easily incorporated into a return-conditioned framework for online finetuning, by assigning return values to possible futures and sampling future embeddings based on their respective values. Empirically, PDT outperforms or performs on par with its supervised pretraining counterpart, especially when dealing with sub-optimal data. Further analysis reveals that PDT can extract diverse behaviors from offline data and controllably sample high-return behaviors by online finetuning. Code is available at here.

ICLR Conference 2023 Conference Paper

Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition

Canzhe Zhao
Ruofeng Yang
Baoxiang Wang 0001
Shuai Li 0010

We study reinforcement learning (RL) with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, the unknown transition probability function is a linear mixture model \citep{AyoubJSWY20,ZhouGS21,HeZG22} with a given feature mapping, and the learner only observes the losses of the experienced state-action pairs instead of the whole loss function. We propose an efficient algorithm LSUOB-REPS which achieves $\widetilde{O}(dS^2\sqrt{K}+\sqrt{HSAK})$ regret guarantee with high probability, where $d$ is the ambient dimension of the feature mapping, $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the episode length and $K$ is the number of episodes. Furthermore, we also prove a lower bound of order $\Omega(dH\sqrt{K}+\sqrt{HSAK})$ for this setting. To the best of our knowledge, we make the first step to establish a provably efficient algorithm with a sublinear regret guarantee in this challenging setting and solve the open problem of \citet{HeZG22}.

SODA Conference 2023 Conference Paper

Player-optimal Stable Regret for Bandit Learning in Matching Markets

Fang Kong 0002
Shuai Li 0010

ICLR Conference 2023 Conference Paper

Stochastic No-regret Learning for General Games with Variance Reduction

Yichi Zhou
Fang Kong 0002
Shuai Li 0010

We show that a stochastic version of optimistic mirror descent (OMD), a variant of mirror descent with recency bias, converges fast in general games. More specifically, with our algorithm, the individual regret of each player vanishes at a speed of $O(1/T^{3/4})$ and the sum of all players' regret vanishes at a speed of $O(1/T)$, which is an improvement upon the $O(1/\sqrt{T})$ convergence rate of prior stochastic algorithms, where $T$ is the number of interaction rounds. Due to the advantage of stochastic methods in the computational cost, we significantly improve the time complexity over the deterministic algorithms to approximate coarse correlated equilibrium. To achieve lower time complexity, we equip the stochastic version of OMD in \cite{alacaoglu2021stochastic} with a novel low-variance Monte-Carlo estimator. Our algorithm extends previous works \cite{alacaoglu2021stochastic,carmon2019variance} from two-player zero-sum games to general games.

UAI Conference 2022 Conference Paper

Federated online clustering of bandits

Xutong Liu 0002
Haoru Zhao
Tong Yu 0001
Shuai Li 0010
John C. S. Lui

Contextual multi-armed bandit (MAB) is an important sequential decision-making problem in recommendation systems. A line of works, called the clustering of bandits (CLUB), utilize the collaborative effect over users and dramatically improve the recommendation quality. Owing to the increasing application scale and public concerns about privacy, there is a growing demand to keep user data decentralized and push bandit learning to the local server side. Existing CLUB algorithms, however, are designed under the centralized setting where data are available at a central server. We focus on studying the federated online clustering of bandit (FCLUB) problem, which aims to minimize the total regret while satisfying privacy and communication considerations. We design a new phase-based scheme for cluster detection and a novel asynchronous communication protocol for cooperative bandit learning for this problem. To protect users’ privacy, previous differential privacy (DP) definitions are not very suitable, and we propose a new DP notion that acts on the user cluster level. We provide rigorous proofs to show that our algorithm simultaneously achieves (clustered) DP, sublinear communication complexity and sublinear regret. Finally, experimental evaluations show our superior performance compared with benchmark algorithms.

ICML Conference 2022 Conference Paper

Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback

Fang Kong 0002
Yichi Zhou
Shuai Li 0010

The problem of online learning with graph feedback has been extensively studied in the literature due to its generality and potential to model various learning tasks. Existing works mainly study the adversarial and stochastic feedback separately. If the prior knowledge of the feedback mechanism is unavailable or wrong, such specially designed algorithms could suffer great loss. To avoid this problem, \citet{erez2021towards} try to optimize for both environments. However, they assume the feedback graphs are undirected and each vertex has a self-loop, which compromises the generality of the framework and may not be satisfied in applications. With a general feedback graph, the observation of an arm may not be available when this arm is pulled, which makes the exploration more expensive and the algorithms more challenging to perform optimally in both environments. In this work, we overcome this difficulty by a new trade-off mechanism with a carefully-designed proportion for exploration and exploitation. We prove the proposed algorithm simultaneously achieves $\mathrm{poly} \log T$ regret in the stochastic setting and minimax-optimal regret of $\tilde{O}(T^{2/3})$ in the adversarial setting where $T$ is the horizon and $\tilde{O}$ hides parameters independent of $T$ as well as logarithmic terms. To our knowledge, this is the first best-of-both-worlds result for general feedback graphs.

ICLR Conference 2020 Conference Paper

The Gambler's Problem and Beyond

Baoxiang Wang 0001
Shuai Li 0010
Jiajin Li
Siu On Chan

We analyze the Gambler's problem, a simple reinforcement learning problem where the gambler has the chance to double or lose their bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by Sutton and Barto (2018), where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points. It is however without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous cases. Though simple as it might seem, the value function is pathological: fractal, self-similar, derivative taking either zero or infinity, not smooth on any interval, and not written as elementary functions. It is in fact one of the generalized Cantor functions, where it holds a complexity that has been uncharted thus far. Our analyses could lead insights into improving value function approximation, gradient-based algorithms, and Q-learning, in real applications and implementations.

ICML Conference 2019 Conference Paper

Online Learning to Rank with Features

Shuai Li 0010
Tor Lattimore
Csaba Szepesvári

We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art.

ICML Conference 2016 Conference Paper

Contextual Combinatorial Cascading Bandits

Shuai Li 0010
Baoxiang Wang 0001
Shengyu Zhang 0002
Wei Chen 0013

We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a prefix in the selected items by some stopping criterion. In online recommendation, the stopping criterion might be the first item a user selects; in network routing, the stopping criterion might be the first edge blocked in a path. We consider position discounts in the list order, so that the agent’s reward is discounted depending on the position where the stopping criterion is met. We design a UCB-type algorithm, C^3-UCB, for this problem, prove an n-step regret bound \tildeO(\sqrtn) in the general setting, and give finer analysis for two special cases. Our work generalizes existing studies in several directions, including contextual information, position discounts, and a more general cascading bandit model. Experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.