Arrow Research search

Author name cluster

Xiangyu Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

AAAI Conference 2026 Conference Paper

Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

  • Yi Liu
  • Xiangyu Liu
  • Zequn Sun
  • Wei Hu

Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the reasoning performance.

AAAI Conference 2026 Conference Paper

ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders

  • Xiangyu Liu
  • Haodi Lei
  • Yi Liu
  • Yang Liu
  • Wei Hu

Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.

AAAI Conference 2025 Conference Paper

Controllable Protein Sequence Generation with LLM Preference Optimization

  • Xiangyu Liu
  • Yi Liu
  • Silei Chen
  • Wei Hu

Designing proteins with specific attributes offers an important solution to address biomedical challenges. Pre-trained protein large language models (LLMs) have shown promising results on protein sequence generation. However, to control sequence generation for specific attributes, existing work still exhibits poor functionality and structural stability. In this paper, we propose a novel controllable protein design method called CtrlProt. We finetune a protein LLM with a new multi-listwise preference optimization strategy to improve generation quality and support multi-attribute controllable generation. Experiments demonstrate that CtrlProt can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both single-attribute and multi-attribute protein sequence generation.

ICLR Conference 2025 Conference Paper

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

  • Chanwoo Park
  • Xiangyu Liu
  • Asuman E. Ozdaglar
  • Kaiqing Zhang

Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of regret. We first empirically study the no-regret behaviors of LLMs in canonical non-stochastic online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To further promote the no-regret behaviors, we propose a novel unsupervised training loss of regret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. Finally, we establish the statistical guarantee of generalization bound for regret-loss minimization, and more importantly, the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms, when single-layer self-attention models are used. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above “regrettable” cases.

AAAI Conference 2025 Conference Paper

Is Poisoning a Real Threat to DPO? Maybe More So Than You Think

  • Pankayaraj Pathmanathan
  • Souradip Chakraborty
  • Xiangyu Liu
  • Yongyuan Liang
  • Furong Huang

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Preference Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4% of the data to be poisoned to elicit harmful behavior, we exploit the vulnerabilities of DPO by simpler methods so we can poison the model with only as much as 0.5% of the data. We further the investigate efficacy of the existing defence methods and find that these poisoning attacks can evade the existing data anomaly detection methods.

NeurIPS Conference 2025 Conference Paper

RAGRouter: Learning to Route Queries to Multiple Retrieval-Augmented Language Models

  • Jiarui Zhang
  • Xiangyu Liu
  • Yong Hu
  • Chaoyue Niu
  • Fan Wu
  • Guihai Chen

Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings, covering open and closed-source LLMs, show that RAGRouter outperforms the best individual LLM and existing routing methods. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints. The code and data are available at https: //github. com/OwwO99/RAGRouter.

ICLR Conference 2024 Conference Paper

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

  • Xiangyu Liu
  • Chenghao Deng
  • Yanchao Sun
  • Yongyuan Liang
  • Furong Huang

In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond merely worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic difficulty in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a compact, finite policy class $\tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a finite and compact $\tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.

ICLR Conference 2024 Conference Paper

Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations

  • Yongyuan Liang
  • Yanchao Sun
  • Ruijie Zheng
  • Xiangyu Liu
  • Benjamin Eysenbach
  • Tuomas Sandholm
  • Furong Huang
  • Stephen Marcus McAleer

Deploying reinforcement learning (RL) systems requires robustness to uncertainty and model misspecification, yet prior robust RL methods typically only study noise introduced independently across time. However, practical sources of uncertainty are usually coupled across time. We formally introduce temporally-coupled perturbations, presenting a novel challenge for existing robust RL methods. To tackle this challenge, we propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially-observable two-player zero-sum game. By finding an approximate equilibrium within this game, GRAD optimizes for general robustness against temporally-coupled perturbations. Experiments on continuous control tasks demonstrate that, compared with prior methods, our approach achieves a higher degree of robustness to various types of attacks on different attack domains, both in settings with temporally-coupled perturbations and decoupled perturbations.

AAAI Conference 2024 Conference Paper

Knowledge Graph Error Detection with Contrastive Confidence Adaption

  • Xiangyu Liu
  • Yang Liu
  • Wei Hu

Knowledge graphs (KGs) often contain various errors. Previous works on detecting errors in KGs mainly rely on triplet embedding from graph structure. We conduct an empirical study and find that these works struggle to discriminate noise from semantically-similar correct triplets. In this paper, we propose a KG error detection model CCA to integrate both textual and graph structural information from triplet reconstruction for better distinguishing semantics. We design interactive contrastive learning to capture the differences between textual and structural patterns. Furthermore, we construct realistic datasets with semantically-similar noise and adversarial noise. Experimental results demonstrate that CCA outperforms state-of-the-art baselines, especially on semantically-similar noise and adversarial noise.

NeurIPS Conference 2024 Conference Paper

Provable Partially Observable Reinforcement Learning with Privileged Information

  • Yang Cai
  • Xiangyu Liu
  • Argyris Oikonomou
  • Kaiqing Zhang

Partial observability of the underlying states generally presents significant challenges for reinforcement learning (RL). In practice, certain privileged information, e. g. , the access to states from simulators, has been exploited in training and achieved prominent empirical successes. To better understand the benefits of privileged information, we revisit and examine several simple and practically used paradigms in this setting, with both computation and sample efficiency analyses. Specifically, we first formalize the empirical paradigm of expert distillation (also known as teacher-student learning), demonstrating its pitfall in finding near-optimal policies. We then identify a condition of the partially observable environment, the deterministic filter condition, under which expert distillation achieves sample and computational complexities that are both polynomial. Furthermore, we investigate another successful empirical paradigm of asymmetric actor-critic, and focus on the more challenging setting of observable partially observable Markov decision processes. We develop a belief-weighted optimistic asymmetric actor-critic algorithm with polynomial sample and quasi-polynomial computational complexities, where one key component is a new provable oracle for learning belief states that preserve filter stability under a misspecified model, which may be of independent interest. Finally, we also investigate the provable efficiency of partially observable multi-agent RL (MARL) with privileged information. We develop algorithms with the feature of centralized-training-with-decentralized-execution, a popular framework in empirical MARL, with polynomial sample and (quasi-)polynomial computational complexity in both paradigms above. Compared with a few recent related theoretical studies, our focus is on understanding practically inspired algorithmic paradigms, without computationally intractable oracles.

ICLR Conference 2024 Conference Paper

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

  • Xiangyu Liu
  • Souradip Chakraborty
  • Yanchao Sun
  • Furong Huang

Most existing works focus on direct perturbations to the victim's state/action or the underlying transition dynamics to demonstrate the vulnerability of reinforcement learning agents to adversarial attacks. However, such direct manipulations may not be always realizable. In this paper, we consider a multi-agent setting where a well-trained victim agent $\nu$ is exploited by an attacker controlling another agent $\alpha$ with an \textit{adversarial policy}. Previous models do not account for the possibility that the attacker may only have partial control over $\alpha$ or that the attack may produce easily detectable ``abnormal'' behaviors. Furthermore, there is a lack of provably efficient defenses against these adversarial policies. To address these limitations, we introduce a generalized attack framework that has the flexibility to model to what extent the adversary is able to control the agent, and allows the attacker to regulate the state distribution shift and produce stealthier adversarial policies. Moreover, we offer a provably efficient defense with polynomial convergence to the most robust victim policy through adversarial training with timescale separation. This stands in sharp contrast to supervised learning, where adversarial training typically provides only \textit{empirical} defenses. Using the Robosumo competition experiments, we show that our generalized attack formulation results in much stealthier adversarial policies when maintaining the same winning rate as baselines. Additionally, our adversarial training approach yields stable learning dynamics and less exploitable victim policies.

ICRA Conference 2024 Conference Paper

Weathering Ongoing Uncertainty: Learning and Planning in a Time-Varying Partially Observable Environment

  • Gokul Puthumanaillam
  • Xiangyu Liu
  • Negar Mehr
  • Melkior Ornik

Optimal decision-making presents a significant challenge for autonomous systems operating in uncertain, stochastic and time-varying environments. Environmental variability over time can significantly impact the system’s optimal decision making strategy for mission completion. To model such environments, our work combines the previous notion of Time-Varying Markov Decision Processes (TVMDP) with partial observability and introduces Time-Varying Partially Observable Markov Decision Processes (TV-POMDP). We propose a twopronged approach to accurately estimate and plan within the TV-POMDP: 1) Memory Prioritized State Estimation (MPSE), which leverages weighted memory to provide more accurate time-varying transition estimates; and 2) an MPSE-integrated planning strategy that optimizes long-term rewards while accounting for temporal constraint. We validate the proposed framework and algorithms using simulations and hardware, with robots exploring a partially observable, time-varying environments. Our results demonstrate superior performance over standard methods, highlighting the framework’s effectiveness in stochastic, uncertain, time-varying domains.

ICLR Conference 2023 Conference Paper

Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images

  • Yufei Cui
  • Ziquan Liu
  • Xiangyu Liu
  • Xue Liu 0001
  • Cong Wang 0001
  • Tei-Wei Kuo
  • Chun Jason Xue
  • Antoni B. Chan

Multiple instance learning (MIL) is a popular weakly-supervised learning model on the whole slide image (WSI) for AI-assisted pathology diagnosis. The recent advance in attention-based MIL allows the model to find its region-of-interest (ROI) for interpretation by learning the attention weights for image patches of WSI slides. However, we empirically find that the interpretability of some related methods is either untrustworthy as the principle of MIL is violated or unsatisfactory as the high-attention regions are not consistent with experts' annotations. In this paper, we propose Bayes-MIL to address the problem from a probabilistic perspective. The induced patch-level uncertainty is proposed as a new measure of MIL interpretability, which outperforms previous methods in matching doctors annotations. We design a slide-dependent patch regularizer (SDPR) for the attention, imposing constraints derived from the MIL assumption, on the attention distribution. SDPR explicitly constrains the model to generate correct attention values. The spatial information is further encoded by an approximate convolutional conditional random field (CRF), for better interpretability. Experimental results show Bayes-MIL outperforms the related methods in patch-level and slide-level metrics and provides much better interpretable ROI on several large-scale WSI datasets.

JBHI Journal 2023 Journal Article

Development of Prognostic Biomarkers by TMB-Guided WSI Analysis: A Two-Step Approach

  • Xiangyu Liu
  • Zhenyu Liu
  • Ye Yan
  • Kai Wang
  • Aodi Wang
  • Xiongjun Ye
  • Liwei Wang
  • Wei Wei

The rapid development of computational pathology has brought new opportunities for prognosis prediction using histopathological images. However, the existing deep learning frameworks lack exploration of the relationship between images and other prognostic information, resulting in poor interpretability. Tumor mutation burden (TMB) is a promising biomarker for predicting the survival outcomes of cancer patients, but its measurement is costly. Its heterogeneity may be reflected in histopathological images. Here, we report a two-step framework for prognostic prediction using whole-slide images (WSIs). First, the framework adopts a deep residual network to encode the phenotype of WSIs and classifies patient-level TMB by the deep features after aggregation and dimensionality reduction. Then, the patients' prognosis is stratified by the TMB-related information obtained during the classification model development. Deep learning feature extraction and TMB classification model construction are performed on an in-house dataset of 295 Haematoxylin & Eosin stained WSIs of clear cell renal cell carcinoma (ccRCC). The development and evaluation of prognostic biomarkers are performed on The Cancer Genome Atlas-Kidney ccRCC (TCGA-KIRC) project with 304 WSIs. Our framework achieves good performance for TMB classification with an area under the receiver operating characteristic curve (AUC) of 0. 813 on the validation set. Through survival analysis, our proposed prognostic biomarkers can achieve significant stratification of patients' overall survival (P $< $ 0. 05) and outperform the original TMB signature in risk stratification of patients with advanced disease. The results indicate the feasibility of mining TMB-related information from WSI to achieve stepwise prognosis prediction.

ICML Conference 2023 Conference Paper

Partially Observable Multi-agent RL with (Quasi-)Efficiency: The Blessing of Information Sharing

  • Xiangyu Liu
  • Kaiqing Zhang

We study provable multi-agent reinforcement learning (MARL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we propose to leverage the potential information-sharing among agents, a standard practice in empirical MARL and a common model for multi-agent control systems with communications. We first establish several computation complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for computational efficiency in solving POSGs. We then propose to further approximate the shared common information to construct an approximate model of the POSG, in which planning an approximate equilibrium (in terms of solving the original POSG) can be quasi-efficient, i. e. , of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable MARL algorithm that is both statistically and computationally quasi-efficient. We hope our study can open up the possibilities of leveraging and even designing different information structures, for developing both sample- and computation-efficient partially observable MARL.

JBHI Journal 2022 Journal Article

Cancelable HD-SEMG Biometric Identification via Deep Feature Learning

  • Jiahao Fan
  • Xinyu Jiang
  • Xiangyu Liu
  • Xian Zhao
  • Xinming Ye
  • Chenyun Dai
  • Metin Akay
  • Wei Chen

Conventional biometric modalities, such as the face, fingerprint, and iris, are vulnerable against imitation and circumvention. Accordingly, secure biometric modalities with cancelable properties are needed for personal identification, especially in smart healthcare applications. Here we developed a person identification model using high-density surface electromyography (HD-sEMG) as biometric traits. In this model, the HD-sEMG biometric templates are cancelable and could be customized by the users through finger isometric contractions. A deep feature learning approach, implemented by convolutional neural networks (CNNs) is used to capture user-specific patterns from HD-sEMG signals and make identification decisions. This model has been validated on twenty-two subjects, with training and testing data acquired from two different days. The rank-1 identification accuracy and equal error rate for 44 identities (22 subjects × 2 accounts) can reach 87. 23% and 4. 66%, respectively. The cross-day identification accuracy of the proposed model is higher than the results of previous methods reported in the literature. The usability and efficiency of the proposed model are also investigated, indicating its potentials for practical applications.

TCS Journal 2022 Journal Article

Tightly CCA-secure inner product functional encryption scheme

  • Xiangyu Liu
  • Shengli Liu
  • Shuai Han
  • Dawu Gu

Inner product functional encryption (IPFE) is a modern public key paradigm where the master key can derive a secret key s k y for a vector y, which can then be used to decrypt a ciphertext of x to get the inner product 〈 x, y 〉 as output. In ASIACRYPT 2019, Tomida proposed the first tightly secure IPFE scheme in the multi-user and multi-challenge setting based on the matrix decisional Diffie-Hellman (MDDH) assumption. However, the construction achieves CPA security only. Up to now, there is no IPFE scheme with tight CCA security available. In this paper, we construct the first tightly CCA-secure IPFE scheme in the multi-user and multi-challenge setting. The security reduction to the MDDH assumption (including SXDH, k-LIN, etc.) loses only a factor O ( log ⁡ λ ) with λ the security parameter. Moreover, our scheme enjoys full compactness. To support inner product function of dimension m, our SXDH-based IPFE has ( m 2 + 8 m + 14 ) and ( 3 m + 14 ) group elements in the master public key and ciphertext respectively. This is comparable to the tightly CPA-secure IPFE proposed by Tomida based on the DDH assumption, whose master public key and ciphertext contain ( m 2 + 2 ) and 3m group elements, respectively. Furthermore, we construct the first IPFE with both tight CCA-security and function-hiding property, based on our CCA-secure IPFE. The tight function-hiding CCA security is obtained by adapting the techniques in Lin (CRYPTO 2017) and Gay (PKC 2020) to the multi-user and multi-challenge setting.

JBHI Journal 2021 Journal Article

Cancelable HD-sEMG-Based Biometrics for Cross-Application Discrepant Personal Identification

  • Xinyu Jiang
  • Ke Xu
  • Xiangyu Liu
  • Chenyun Dai
  • David A. Clifton
  • Edward A. Clancy
  • Metin Akay
  • Wei Chen

With the soaring development of body sensor network (BSN)-based health informatics, information security in such medical devices has attracted increasing attention in recent years. Employing the biosignals acquired directly by the BSN as biometrics for personal identification is an effective approach. Noncancelability and cross-application invariance are two natural flaws of most traditional biometric modalities. Once the biometric template is exposed, it is compromised forever. Even worse, because the same biometrics may be employed as tokens for different accounts in multiple applications, the exposed template can be used to compromise other accounts. In this work, we propose a cancelable and cross-application discrepant biometric approach based on high-density surface electromyogram (HD-sEMG) for personal identification. We enrolled two accounts for each user. HD-sEMG signals from the right dorsal hand under isometric contractions of different finger muscles were employed as biometric tokens. Since isometric contraction, in contrast to dynamic contraction, requires no actual movement, the users’ choice to login to different accounts is greatly protected against impostors. We realized a promising identification accuracy of 85. 8% for 44 identities (22 subjects × 2 accounts) with training and testing data acquired 9 days apart. The high identification accuracy of different accounts for the same user demonstrates the promising cancelability and cross-application discrepancy of the proposed HD-sEMG-based biometrics. To the best of our knowledge, this is the first study to employ HD-sEMG in personal identification applications, with signal variation across days considered.

IROS Conference 2021 Conference Paper

Coxgraph: Multi-Robot Collaborative, Globally Consistent, Online Dense Reconstruction System

  • Xiangyu Liu
  • Weicai Ye
  • Chaoran Tian
  • Zhaopeng Cui
  • Hujun Bao
  • Guofeng Zhang 0001

Real-time dense reconstruction has been extensively studied for its wide applications in computer vision and robotics, meanwhile much effort has been made for the multi-robot system which plays an irreplaceable role in complicated but time-critical scenarios, e. g. , search and rescue tasks. In this paper, we propose an efficient system named Coxgraph for multi-robot collaborative dense reconstruction in real-time. In our system, each client performs volumetric mapping in a producer-consumer manner. To facilitate transmission, we propose a compact 3D representation which transforms the SDF submap to mesh packs. During the recovery of submaps from mesh packs, the system can perform loop closure outlier rejection based on geometry consistency, trajectory collision and fitness check. Then we develop a robust map fusion method through joint optimization of trajectories and submaps. Extensive experiments demonstrate that our system can produce a globally consistent dense map in real-time with less transmission load, which is available as open-source software 1.

NeurIPS Conference 2021 Conference Paper

Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

  • Xiangyu Liu
  • Hangtian Jia
  • Ying Wen
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Zhipeng Hu
  • Yaodong Yang

Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e. g. , Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex \textit{Google Research Football} environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in \textit{Google Research Football}.