Arrow Research search

Author name cluster

Yu Bai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

28 papers
2 author rows

Possible papers

28

AAAI Conference 2026 Conference Paper

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

  • Yu Bai
  • Heyan Huang
  • Cesare Spinoso-Di Piano
  • Sanxing Chen
  • Marc-Antoine Rondeau
  • Yang Gao
  • Jackie Chi Kit Cheung

In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM's performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how LLMs learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in LLMs.

NeurIPS Conference 2025 Conference Paper

Accelerated Vertical Federated Adversarial Learning through Decoupling Layer-Wise Dependencies

  • Tianxing Man
  • Yu Bai
  • Ganyu Wang
  • Jinjie Fang
  • Haoran Fang
  • Bin Gu
  • Yi Chang

Vertical Federated Learning (VFL) enables participants to collaboratively train models on aligned samples while keeping their heterogeneous features private and distributed. Despite their utility, VFL models remain vulnerable to adversarial attacks during inference. Adversarial Training (AT), which generates adversarial examples at each training iteration, stands as the most effective defense for improving model robustness. However, applying AT in VFL settings (VFAL) faces significant computational efficiency challenges, as the distributed training framework necessitates iterative propagations across participants. To this end, we propose **_DecVFAL_** framework, which substantially accelerates **_VFAL_** training through a dual-level ***Dec***oupling mechanism applied during adversarial sample generation. Specifically, we first decouple the bottom modules of clients (directly responsible for adversarial updates) from the remaining networks, enabling efficient _lazy sequential propagations_ that reduce communication frequency through delayed gradients. We further introduce _decoupled parallel backpropagation_ to accelerate delayed gradient computation by eliminating idle waiting through parallel processing across modules. Additionally, we are the first to establish convergence analysis for VFAL, rigorously characterizing how our decoupling mechanism interacts with existing VFL dynamics, and prove that _DecVFAL_ achieves an $\mathcal{O}(1/\sqrt{K})$ convergence rate matching that of standard VFLs. Experimental results show that _DecVFAL_ ensures competitive robustness while significantly achieving about $3\sim10\times$ speed up.

AAAI Conference 2025 Conference Paper

Excluding the Impossible for Open Vocabulary Semantic Segmentation

  • Shiyuan Zhao
  • Baodi Liu
  • Yu Bai
  • Weifeng Liu
  • Shuai Shao

Open vocabulary semantic segmentation is a hot topic in research, focusing on segmenting and recognizing a diverse array of categories in varied environments, including those previously unknown, thereby holding significant practical value. Mainstream studies utilize the CLIP model for direct semantic segmentation (denoted as “forward methods”), which often struggles to represent underrepresented categories effectively. To address this issue, this paper introduces a novel approach Excluding the ImpossibLe Semantic Segmentation Network (ELSE-Net) based on reverse thinking. By excluding improbable categories, ELSE-Net narrows the selection range for forward methods, significantly reducing the risk of misclassification. In implementation, we initially draw on leading research to design the General Processing Block (GP-Block), which generates inclusion probabilities (the likelihood of belonging to a category) by using the CLIP model cooperated with a Mask Proposal Network (MPN). We then present the EXcluding the ImPossible Block (EXP-Block), which computes exclusion probabilities (the likelihood of not belonging to a category) through the CLIPN model and a custom-designed Reverse Retrieval Adapter (R2-Adapter). These exclusion probabilities are subsequently used to refine the inclusion probabilities, which are ultimately employed to annotate class-agnostic masks. Moreover, the core component of our EXP-Block is model-agnostic, enabling it to enhance the capabilities of existing frameworks. Experimental results from four benchmark datasets validate the effectiveness of ELSE-Net and underscore the seamless model-agnostic functionality of the EXP-Block.

AAMAS Conference 2025 Conference Paper

Self-Interpretable Reinforcement Learning via Rule Ensembles

  • Yue Yang
  • Fan Yang
  • Yu Bai
  • Hao Wang

Current reinforcement learning (RL) models, often functioning as complex ‘black boxes, ’ obscure decision-making processes. This lack of transparency limits its applicability in critical real-world applications where clear reasoning behind algorithmic choices is crucial. To tackle this issue, we suggest moving from neural network or tabular approaches to a rule ensemble model, which improves decision-making clarity and adapts dynamically to environmental interactions. Instead, our method constructs additive rule ensembles to approximate the Q-value in reinforcement learning using orthogonal gradient boosting (OGB) combined with a post-processing rule replacement technique. This method enables the model to provide inherent explanations through the use of rules. Our study sets a theoretical foundation for rule ensembles within the reinforcement learning framework, emphasizing their capacity to boost interpretability and facilitate the analysis of rule impacts. Experimental results from seven classic environments demonstrate that our proposed rule ensembles match or exceed the performance of representative RL models such as DQN, A2C, and PPO, while also providing self-interpretability and transparency.

AAAI Conference 2025 Conference Paper

Text2Data: Low-Resource Data Generation with Textual Control

  • Shiyu Wang
  • Yihao Feng
  • Tian Lan
  • Ning Yu
  • Yu Bai
  • Ran Xu
  • Huan Wang
  • Caiming Xiong

Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.

NeurIPS Conference 2025 Conference Paper

Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness

  • Yanyu Ren
  • Li Chen
  • Dan Li
  • Xizheng Wang
  • Zhiyuan Wu
  • Yukai Miao
  • Yu Bai

Large Language Model (LLM) agents are capable of task execution across various domains by autonomously interacting with environments and refining LLM responses based on feedback. However, existing model serving systems are not optimized for the unique demands of serving agents. Compared to classic model serving, agent serving has different characteristics: predictable request pattern, increasing quality requirement, and unique prompt formatting. We identify a key problem for agent serving: LLM serving systems lack session-awareness. They neither perform effective KV cache management nor precisely select the cheapest yet competent model in each round. This leads to a cost-quality tradeoff, and we identify an opportunity to surpass it in an agent serving system. To this end, we introduce AgServe for AGile AGent SERVing. AgServe features a session-aware server that boosts KV cache reuse via Estimated-Time-of-Arrival-based eviction and in-place positional embedding calibration, a quality-aware client that performs session-aware model cascading through real-time quality assessment, and a dynamic resource scheduler that maximizes GPU utilization. With AgServe, we allow agents to select and upgrade models during the session lifetime, and to achieve similar quality at much lower costs, effectively transcending the tradeoff. Extensive experiments on real testbeds demonstrate that AgServe (1) achieves comparable response quality to GPT-4o at a 16. 5\% cost. (2) delivers 1. 8$\times$ improvement in quality relative to the tradeoff curve.

TAAS Journal 2025 Journal Article

TT-DSC: Enhancing YOLO for Marine Ecosystem through Efficient Tensor Train-based Depthwise Separable Deep Neural Network

  • Yunduan Lou
  • Pu Sun
  • Yifeng Yu
  • Shangping Ren
  • Yu Bai

The current era of Artificial Intelligence (AI) has witnessed significant and continuous advancements based on the powerful learning capabilities of Deep Neural Networks (DNNs), particularly those featuring convolutional (CONV) layers. In the field of marine ecosystem conservation, these advancements have revolutionized our ability to monitor and protect ocean environments. DNNs, especially those utilizing YOLO (You Only Look Once) architecture, have been instrumental in tasks such as real-time marine species identification, tracking of marine mammal migrations, detection of coral bleaching events, and monitoring of illegal fishing activities. These AI-powered tools provide unprecedented insights into marine ecosystems, enabling more timely and effective conservation actions. As we aim further to enhance the computational and storage efficiency of these networks, Tensor Train (TT) decomposition has emerged as a notable compression technique due to its high compression ratio and ability to maintain strong performance. However, the CONV layer in TT format still incurs substantial computational costs, stemming from convolution calculations and the additional multiplication operations intrinsic to TT usage. Consequently, reducing these computational costs is critical to improving the effectiveness of DNNs. To advance the computational efficiency of DNNs, this paper introduces a novel separable TT decomposition that offers an efficient TT-format CONV layer using depthwise separable convolution. Remarkably, this method not only reduces computation costs significantly but also maintains a similar capacity for parameter compression and accuracy compared to the conventional TT-format model. Furthermore, our method facilitates distributed learning based on the factorization of CONV layers. By scheduling the smaller-factored weight tensors, we significantly mitigate the GPU memory requirements of the larger model, thereby enhancing the availability and speed of training. The primary contributions of this paper are twofold: (1) a simultaneous reduction in computational cost and parameter count in TT-based CONV layers, achieved by minimizing TT redundancy and optimizing convolution, leading to up to 7–10× improvements per layer, an overall one-third reduction in parameters, and 15% reduction in FLOPs at the model level. 2) We demonstrate how our approach enables effective distributed learning and resource allocation. By merging TT decomposition and depthwise separable convolution, we present TTDSC, a TT-based depthwise separable convolution approach. This study opens new avenues to improve the efficiency of CONV. layers compression and has significant implications for large-scale deep learning applications.

AAAI Conference 2024 Conference Paper

Collaborative Consortium of Foundation Models for Open-World Few-Shot Learning

  • Shuai Shao
  • Yu Bai
  • Yan Wang
  • Baodi Liu
  • Bin Liu

Open-World Few-Shot Learning (OFSL) is a crucial research field dedicated to accurately identifying target samples in scenarios where data is limited and labels are unreliable. This research holds significant practical implications and is highly relevant to real-world applications. Recently, the advancements in foundation models like CLIP and DINO have showcased their robust representation capabilities even in resource-constrained settings with scarce data. This realization has brought about a transformative shift in focus, moving away from “building models from scratch” towards “effectively harnessing the potential of foundation models to extract pertinent prior knowledge suitable for OFSL and utilizing it sensibly”. Motivated by this perspective, we introduce the Collaborative Consortium of Foundation Models (CO3), which leverages CLIP, DINO, GPT-3, and DALL-E to collectively address the OFSL problem. CO3 comprises four key blocks: (1) the Label Correction Block (LC-Block) corrects unreliable labels, (2) the Data Augmentation Block (DA-Block) enhances available data, (3) the Feature Extraction Block (FE-Block) extracts multi-modal features, and (4) the Text-guided Fusion Adapter (TeFu-Adapter) integrates multiple features while mitigating the impact of noisy labels through semantic constraints. Only the adapter's parameters are adjustable, while the others remain frozen. Through collaboration among these foundation models, CO3 effectively unlocks their potential and unifies their capabilities to achieve state-of-the-art performance on multiple benchmark datasets. https://github.com/The-Shuai/CO3.

NeurIPS Conference 2023 Conference Paper

Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations

  • Minshuo Chen
  • Yu Bai
  • H. Vincent Poor
  • Mengdi Wang

In real-world reinforcement learning (RL) systems, various forms of {\it impaired observability} can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form $\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK})$, for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability.

AAMAS Conference 2023 Conference Paper

Learning Solutions in Large Economic Networks using Deep Multi-Agent Reinforcement Learning

  • Michael Curry
  • Alexander Trott
  • Soham Phade
  • Yu Bai
  • Stephan Zheng

Real-world economies can be modeled as a network with many heterogeneous and strategic agents. In this setting, it is very challenging to find optimal mechanisms, e. g. , taxes, 1) when taking strategic best responses into account and 2) even when using restrictive assumptions, e. g. , that supply always meets demand. Deep multi-agent reinforcement learning (MARL) is a natural framework to learn mechanisms and model strategic best responses, but independent MARL often collapses to trivial solutions (e. g. , where nobody works) as joint exploration severely distorts rewards and constraints. Here, we show how to use structured learning curricula and GPU-accelerated simulations to find non-trivial solutions in networks with many heterogeneous agents. We validate our approach in models with 100 worker-consumers, 10 firms, and a social planner who taxes and redistributes. We use empirical bestresponse analyses across agent types to show that it is difficult for agents to benefit by deviating from the learned solutions. In particular, we find income and corporate taxes that achieve 15% higher social welfare compared to baselines.

ICML Conference 2023 Conference Paper

Offline Learning in Markov Games with General Function Approximation

  • Yuheng Zhang
  • Yu Bai
  • Nan Jiang

We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium—such as Nash equilibrium and (Coarse) Correlated Equilibrium—from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies’ returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed “unilateral concentrability”. Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.

NeurIPS Conference 2023 Conference Paper

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

  • Yu Bai
  • Fan Chen
  • Huan Wang
  • Caiming Xiong
  • Song Mei

Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life---A \emph{single} transformer can adaptively select different base ICL algorithms---or even perform qualitatively different tasks---on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task---noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.

NeurIPS Conference 2023 Conference Paper

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

  • Hengyu Fu
  • Tianyu Guo
  • Yu Bai
  • Song Mei

Attention layers---which map a sequence of inputs to a sequence of outputs---are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a rigorous theoretical study on the learning and generalization of a single multi-head attention layer, with a sequence of key vectors and a separate query vector as input. We consider the random feature setting where the attention layer has a large number of heads, with randomly sampled frozen query and key matrices, and trainable value matrices. We show that such a random-feature attention layer can express a broad class of target functions that are permutation invariant to the key vectors. We further provide quantitative excess risk bounds for learning these target functions from finite samples, using random feature attention with finitely many heads. Our results feature several implications unique to the attention structure compared with existing random features theory for neural networks, such as (1) Advantages in the sample complexity over standard two-layer random-feature networks; (2) Concrete and natural classes of functions that can be learned efficiently by a random-feature attention layer; and (3) The effect of the sampling distribution of the query-key weight matrix (the product of the query and key matrix), where Gaussian random weights with a non-zero mean result in better sample complexities over the zero-mean counterpart for learning certain natural target functions. Experiments on simulated data corroborate our theoretical findings and further illustrate the interplay between the sample size and the complexity of the target function.

JBHI Journal 2022 Journal Article

A Scalable Graph-Based Framework for Multi-Organ Histology Image Classification

  • Yu Bai
  • Yue Mi
  • Yihan Su
  • Bo Zhang
  • Zheng Zhang
  • Jingyun Wu
  • Haiwen Huang
  • Yongping Xiong

Graph-based approaches are successful for histology image classification tasks but still face many challenges, such as: 1) the lack of nuclei-level labels and the significant variations between histology images make it extremely difficult to extract discriminative high-level nuclei features like nuclei type, texture and micro-environment; 2) graph-based approaches cannot handle large-scale cell graph nodes typically contained in histology images; and 3) graph neural networks (GNNs) struggle to learn the long-range dependency of cell graphs. To address the above challenges, we propose a scalable graph-based framework for multi-organ histology image classification. We develop a two-step masked nuclei patches supervised training approach to extract discriminative high-level nuclei features for histology images without nuclei-level labels. Additionally, we introduce a nuclei sampling strategy to make our graph-based framework scalable for large-scale cell graphs. Furthermore, we propose H ier A rchical T ransformer Graph Neural Net work (HAT-Net+) for cell graph classi- fications. HAT-Net+ adopts Transformer to model the long-range dependency of cell graphs and a parameter-free approach to adaptively fuse different hierarchical graph representations of each layer. We achieved the state-of-the-art results on four public histology image classification datasets: CRC dataset (100%), Extended CRC dataset (98%), UZH dataset (96. 9%) and BACH dataset (88%). Unlike other methods, our approach can be used in various histology image classification tasks, even for images without nuclei-level labels, indicating its potential in cancer diagnosis. The code is available at https://github.com/suyouooooo/HAT-Net.

NeurIPS Conference 2022 Conference Paper

Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent

  • Yu Bai
  • Chi Jin
  • Song Mei
  • Ziang Song
  • Tiancheng Yu

A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm---A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.

NeurIPS Conference 2022 Conference Paper

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

  • Eshaan Nichani
  • Yu Bai
  • Jason D. Lee

A recent goal in the theory of deep learning is to identify how neural networks can escape the “lazy training, ” or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the “QuadNTK” approach of Bai & Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai & Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari & Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify “good” directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages the parameter vector to move in the “good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.

NeurIPS Conference 2022 Conference Paper

Policy Optimization for Markov Games: Unified Framework and Faster Convergence

  • Runyu Zhang
  • Qinghua Liu
  • Huan Wang
  • Caiming Xiong
  • Na Li
  • Yu Bai

This paper studies policy optimization algorithms for multi-agent reinforcement learning. We begin by proposing an algorithm framework for two-player zero-sum Markov Games in the full-information setting, where each iteration consists of a policy update step at each state using a certain matrix game algorithm, and a value update step with a certain learning rate. This framework unifies many existing and new policy optimization algorithms. We show that the \emph{state-wise average policy} of this algorithm converges to an approximate Nash equilibrium (NE) of the game, as long as the matrix game algorithms achieve low weighted regret at each state, with respect to weights determined by the speed of the value updates. Next, we show that this framework instantiated with the Optimistic Follow-The-Regularized-Leader (OFTRL) algorithm at each state (and smooth value updates) can find an $\mathcal{\widetilde{O}}(T^{-5/6})$ approximate NE in $T$ iterations, and a similar algorithm with slightly modified value update rule achieves a faster $\mathcal{\widetilde{O}}(T^{-1})$ convergence rate. These improve over the current best $\mathcal{\widetilde{O}}(T^{-1/2})$ rate of symmetric policy optimization type algorithms. We also extend this algorithm to multi-player general-sum Markov Games and show an $\mathcal{\widetilde{O}}(T^{-3/4})$ convergence rate to Coarse Correlated Equilibria (CCE). Finally, we provide a numerical example to verify our theory and investigate the importance of smooth value updates, and find that using ''eager'' value updates instead (equivalent to the independent natural policy gradient algorithm) may significantly slow down the convergence, even on a simple game with $H=2$ layers.

NeurIPS Conference 2022 Conference Paper

Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games

  • Ziang Song
  • Song Mei
  • Yu Bai

Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing. This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing $K$-EFCE---a generalized definition that allows players to observe and deviate from the recommended actions for $K$ times. The $K$-EFCE includes the EFCE as a special case at $K=1$, and is an increasingly stricter notion of equilibrium as $K$ increases. We then design an uncoupled no-regret algorithm that finds an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K}/\varepsilon^2)$ iterations in the full feedback setting, where $X_i$ and $A_i$ are the number of information sets and actions for the $i$-th player. Our algorithm works by minimizing a wide-range regret at each information set that takes into account all possible recommendation histories. Finally, we design a sample-based variant of our algorithm that learns an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K+1}/\varepsilon^2)$ episodes of play in the bandit feedback setting. When specialized to $K=1$, this gives the first sample-efficient algorithm for learning EFCE from bandit feedback.

IJCAI Conference 2022 Conference Paper

Stage-wise Stylistic Headline Generation: Style Generation and Summarized Content Insertion

  • Jiaao Zhan
  • Yang Gao
  • Yu Bai
  • Qianhui Liu

A quality headline with a high click-rate should not only summarize the content of an article, but also reflect a style that attracts users. Such demand has drawn rising attention to the task of stylistic headline generation (SHG). An intuitive method is to first generate plain headlines leveraged by document-headline parallel data then transfer them to a target style. However, this inevitably suffers from error propagation. Therefore, to unify the two sub-tasks and explicitly decompose style-relevant attributes and summarize content, we propose an end-to-end stage-wise SHG model containing the style generation component and the content insertion component, where the former generates stylistic-relevant intermediate outputs and the latter receives these outputs then inserts the summarized content. The intermediate outputs are observable, making the style generation easy to control. Our system is comprehensively evaluated by both quantitative and qualitative metrics, and it achieves state-of-the-art results in SHG over three different stylistic datasets.

AAAI Conference 2021 Conference Paper

Exploring Explainable Selection to Control Abstractive Summarization

  • Haonan Wang
  • Yang Gao
  • Yu Bai
  • Mirella Lapata
  • Heyan Huang

Like humans, document summarization models can interpret a document’s contents in a number of ways. Unfortunately, the neural models of today are largely black boxes that provide little explanation of how or why they generated a summary in the way they did. Therefore, to begin prying open the black box and to inject a level of control into the substance of the final summary, we developed a novel select-and-generate framework that focuses on explainability. By revealing the latent centrality and interactions between sentences, along with scores for sentence novelty and relevance, users are given a window into the choices a model is making and an opportunity to guide those choices in a more desirable direction. A novel pair-wise matrix captures the sentence interactions, centrality and attribute scores, and a mask with tunable attribute thresholds allows the user to control which sentences are likely to be included in the extraction. A sentence-deployed attention mechanism in the abstractor ensures the final summary emphasizes the desired content. Additionally, the encoder is adaptable, supporting both Transformer- and BERTbased configurations. In a series of experiments assessed with ROUGE metrics and two human evaluations, ESCA outperformed eight state-of-the-art models on the CNN/DailyMail and NYT50 benchmark datasets.

NeurIPS Conference 2021 Conference Paper

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

  • Ming Yin
  • Yu Bai
  • Yu-Xiang Wang

We consider the problem of offline reinforcement learning (RL) --- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose \emph{Off-Policy Double Variance Reduction} (OPDVR), a new variance reduction-based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon \emph{stationary transition} setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best-known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.

NeurIPS Conference 2021 Conference Paper

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

  • Tengyang Xie
  • Nan Jiang
  • Huan Wang
  • Caiming Xiong
  • Yu Bai

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of *policy finetuning*, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp *offline reduction* algorithm---which simply executes $\mu$ and runs offline policy optimization on the collected dataset---that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for *any* policy finetuning algorithm, including those that can adaptively explore the environment. This implies that---perhaps surprisingly---the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

NeurIPS Conference 2021 Conference Paper

Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games

  • Yu Bai
  • Chi Jin
  • Huan Wang
  • Caiming Xiong

Real world applications such as economics and policy making often involve solving multi-agent games with two unique features: (1) The agents are inherently asymmetric and partitioned into leaders and followers; (2) The agents have different reward functions, thus the game is general-sum. The majority of existing results in this field focuses on either symmetric solution concepts (e. g. Nash equilibrium) or zero-sum games. It remains open how to learn the Stackelberg equilibrium ---an asymmetric analog of the Nash equilibrium---in general-sum games efficiently from noisy samples. This paper initiates the theoretical study of sample-efficient learning of the Stackelberg equilibrium, in the bandit feedback setting where we only observe noisy samples of the reward. We consider three representative two-player general-sum games: bandit games, bandit-reinforcement learning (bandit-RL) games, and linear bandit games. In all these games, we identify a fundamental gap between the exact value of the Stackelberg equilibrium and its estimated version using finitely many noisy samples, which can not be closed information-theoretically regardless of the algorithm. We then establish sharp positive results on sample-efficient learning of Stackelberg equilibrium with value optimal up to the gap identified above, with matching lower bounds in the dependency on the gap, error tolerance, and the size of the action spaces. Overall, our results unveil unique challenges in learning Stackelberg equilibria under noisy bandit feedback, which we hope could shed light on future research on this topic.

NeurIPS Conference 2021 Conference Paper

Understanding the Under-Coverage Bias in Uncertainty Estimation

  • Yu Bai
  • Song Mei
  • Huan Wang
  • Caiming Xiong

Estimating the data uncertainty in regression tasks is often done by learning a quantile function or a prediction interval of the true label conditioned on the input. It is frequently observed that quantile regression---a vanilla algorithm for learning quantiles with asymptotic guarantees---tends to *under-cover* than the desired coverage level in reality. While various fixes have been proposed, a more fundamental understanding of why this under-coverage bias happens in the first place remains elusive. In this paper, we present a rigorous theoretical study on the coverage of uncertainty estimation algorithms in learning quantiles. We prove that quantile regression suffers from an inherent under-coverage bias, in a vanilla setting where we learn a realizable linear quantile function and there is more data than parameters. More quantitatively, for $\alpha>0. 5$ and small $d/n$, the $\alpha$-quantile learned by quantile regression roughly achieves coverage $\alpha - (\alpha-1/2)\cdot d/n$ regardless of the noise distribution, where $d$ is the input dimension and $n$ is the number of training data. Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error that is not implied by existing theories on quantile regression. Experiments on simulated and real data verify our theory and further illustrate the effect of various factors such as sample size and model capacity on the under-coverage bias in more practical setups.

NeurIPS Conference 2020 Conference Paper

Near-Optimal Reinforcement Learning with Self-Play

  • Yu Bai
  • Chi Jin
  • Tiancheng Yu

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with S states, A max-player actions and B min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires \tlO(S^2AB) steps of game playing, when only highlighting the dependency on (S, A, B). In contrast, the best existing lower bound scales as \Omega(S(A+B)) and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the Nash Q-learning algorithm with sample complexity \tlO(SAB), and a new Nash V-learning algorithm with sample complexity \tlO(S(A+B)). The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

NeurIPS Conference 2020 Conference Paper

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

  • Minshuo Chen
  • Yu Bai
  • Jason D. Lee
  • Tuo Zhao
  • Huan Wang
  • Caiming Xiong
  • Richard Socher

Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to ``shallow learners'' such as kernels. In this work, we demonstrate that intermediate \emph{neural representations} add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\widetilde{O}(d^{\ceil{p/2}})$ samples, while the best-known sample complexity upper bound for the raw input is $\widetilde{O}(d^{p-1})$. We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.

NeurIPS Conference 2019 Conference Paper

Provably Efficient Q-Learning with Low Switching Cost

  • Yu Bai
  • Tengyang Xie
  • Nan Jiang
  • Yu-Xiang Wang

We take initial steps in studying PAC-MDP algorithms with limited adaptivity, that is, algorithms that change its exploration policy as infrequently as possible during regret minimization. This is motivated by the difficulty of running fully adaptive algorithms in real-world applications (such as medical domains), and we propose to quantify adaptivity using the notion of \emph{local switching cost}. Our main contribution, Q-Learning with UCB2 exploration, is a model-free algorithm for $H$-step episodic MDP that achieves sublinear regret whose local switching cost in $K$ episodes is $O(H^3SA\log K)$, and we provide a lower bound of $\Omega(HSA)$ on the local switching cost for any no-regret algorithm. Our algorithm can be naturally adapted to the concurrent setting \citep{guo2015concurrent}, which yields nontrivial results that improve upon prior work in certain aspects.