Arrow Research search

Author name cluster

Chao Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

AAAI Conference 2026 Conference Paper

CATAL: Causally Disentangled Task Representation Learning for Offline Meta-Reinforcement Learning

  • Shan Cong
  • Chao Yu
  • Xiangyuan Lan

Context-based Offline Meta Reinforcement Learning (COMRL) has shown promising results in improving the cross-task generalization ability of meta-policies. However, current methods often lead to entangled task representations, in which each latent dimension is influenced by multiple causal factors that govern variations in environment dynamics and reward mechanisms. This entanglement can degrade generalization performance, particularly when multiple causal factors vary simultaneously across tasks. To address this limitation, we propose CAusally disentangled TAsk representation Learning (CATAL) method for COMRL that aims to improve the generalization ability of the meta-policy, where each latent dimension in the task representations aligns to a single causal factor.Theoretically, we show that under mild conditions, the task representations learned by CATAL are causally disentangled. Empirically, extensive results on multi-task MuJoCo benchmarks show that CATAL consistently outperforms existing COMRL baselines in both in-distribution and out-of-distribution generalization.

AAAI Conference 2026 Conference Paper

Reliability-Guaranteed and Reward-Seeking Sequence Modeling for Model-Based Offline Reinforcement Learning

  • Shenghong He
  • Chao Yu
  • Qian Lin
  • Yile Liang
  • Donghui Li
  • Xuetao Ding

As a data-driven learning approach, model-based offline reinforcement learning (MORL) aims to learn a policy by exploiting a dynamics model derived from an existing dataset. Applying conservative quantification to the dynamics model, most existing works on MORL generate trajectories that approximate the real data distribution to facilitate policy learning. However, these methods typically overlook the influence of historical information on environmental dynamics, thus generating unreliable trajectories that fail to align with the true data distribution. In this paper, we propose a new MORL algorithm called Reliability-guaranteed and Reward-seeking Transformer (RT). RT can avoid generating unreliable trajectories through the calculation of cumulative reliability of the trajectories, which is a weighted variational distance between the generated trajectory distribution and the true data distribution. Moreover, by sampling candidate actions with high rewards, RT can efficiently generate high-reward trajectories from the existing offline data, thereby further facilitating policy learning. We theoretically prove the performance guarantees of RT in policy learning, and empirically demonstrate its effectiveness against state-of-the-art model-based methods on several offline benchmark tasks and a large-scale industrial dataset from an on-demand food delivery platform.

ICRA Conference 2025 Conference Paper

Human-Robot Cooperative Distribution Coupling for Hamiltonian-Constrained Social Navigation

  • Weizheng Wang 0004
  • Chao Yu
  • Yu Wang
  • Byung-Cheol Min

Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate spatial-temporal environmental dynamics understanding and port-Hamiltonian physical interactive process construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability 1 1 The experimental videos and additional information about this work can be found at: https://sites.google.com/view/NaviDIFF.

JBHI Journal 2025 Journal Article

Inducing Long-Term Plastic Changes and Visual Attention Enhancement Via One-Week Cerebellar Crus II Intermittent Theta Burst Stimulation (iTBS): An EEG Study

  • Meiliang Liu
  • Chao Yu
  • Minjie Tian
  • Jingping Shi
  • Yunfang Xu
  • Zijin Li
  • Zhengye Si
  • Xiaoxiao Yang

Intermittent theta burst stimulation (iTBS) is a non-invasive technique frequently employed to induce neural plastic changes and enhance visual attention. Currently, most studies utilized a single iTBS session on healthy subjects to induce short-term neural plastic changes within tens of minutes post-stimulation and investigate its single-session effect on attention performance. Few studies have conducted multiple iTBS sessions on the cerebellum to explore long-term effects on the cerebral cortex and daily effects on visual attention performance. In this study, 18 healthy subjects were involved in a randomized, sham-controlled experiment over one week. All the subjects received daily session of bilateral cerebellar Crus II iTBS or sham stimulation and completed a visual search task. Resting-state electroencephalogram (EEG) was collected 48 hours pre- and post-experiment to assess plastic changes induced by iTBS. The results indicated that the iTBS group exhibited higher accuracy and lower time costs than the sham group after three sessions of iTBS. In addition, iTBS-induced plastic changes persisted up to 48 hours post-experiment, including left-shifted individual alpha frequency, increased intrinsic excitability (the likelihood that a neuron will generate an output in response to a given input), and enhanced PLV functional connectivity (phase synchronization between different brain region). Furthermore, we found that cerebellar iTBS induced a remote effect on the frontal region. Our study revealed the capacity of cerebellar Crus II iTBS to induce plastic changes and enhance attention performance, providing a potential avenue for using iTBS to promote rehabilitation.

JMLR Journal 2025 Journal Article

Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play

  • Zelai Xu
  • Chao Yu
  • Yancheng Liang
  • Yi Wu
  • Yu Wang

Self-play (SP) is a popular multi-agent reinforcement learning framework for competitive games. Despite the empirical success, the theoretical properties of SP are limited to two-player settings. For team competitive games where two teams of cooperative agents compete with each other, we show a counter-example where SP cannot converge to a global Nash equilibrium (NE) with high probability. Policy-Space Response Oracles (PSRO) is an alternative framework that finds NEs by iteratively learning the best response (BR) to previous policies. PSRO can be directly extended to team competitive games with unchanged convergence properties by learning team BRs, but its repeated training from scratch makes it hard to scale to complex games. In this work, we propose Generalized Fictitious Cross-Play (GFXP), a novel algorithm that inherits benefits from both frameworks. GFXP simultaneously trains an SP-based main policy and a counter population. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the BRs to the main policy's checkpoints. We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

AAAI Conference 2025 Conference Paper

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

  • Zongkai Liu
  • Qian Lin
  • Chao Yu
  • Xiawei Wu
  • Yile Liang
  • Donghui Li
  • Xuetao Ding

Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent's policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates' updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

AAAI Conference 2025 Conference Paper

Rapid Learning in Constrained Minimax Games with Negative Momentum

  • Zijian Fang
  • Zongkai Liu
  • Chao Yu
  • Chaohao Hu

In this paper, we delve into the utilization of the negative momentum technique in constrained minimax games. From an intuitive mechanical standpoint, we introduce a novel framework for momentum buffer updating, which extends the findings of negative momentum from the unconstrained setting to the constrained setting and provides a universal enhancement to the classic game-solver algorithms. Additionally, we provide theoretical guarantees of convergence for our momentum-augmented learning algorithms. We then extend these algorithms to their extensive-form counterparts. Experimental results on both Normal Form Games (NFGs) and Extensive Form Games (EFGs) demonstrate that our momentum techniques can significantly improve algorithm performance, surpassing both their original versions and the SOTA baselines by a large margin.

NeurIPS Conference 2025 Conference Paper

ReinFlow: Fine-tuning Flow Matching Policy with Online Reinforcement Learning

  • Tonghe Zhang
  • Chao Yu
  • Sichang Su
  • Yu Wang

We propose ReinFlow, a simple yet effective online reinforcement learning (RL) framework that fine-tunes a family of flow matching policies for continuous robotic control. Derived from rigorous RL theory, ReinFlow injects learnable noise into a flow policy’s deterministic path, converting the flow into a discrete-time Markov Process for exact and straightforward likelihood computation. This conversion facilitates exploration and ensures training stability, enabling ReinFlow to fine-tune diverse flow model variants stably, including Rectified Flow [34] and Shortcut Models [18], particularly at very few or even one denoising step. We benchmark ReinFlow in representative locomotion and manipulation tasks, including long- horizon planning with visual input and sparse reward. The episode reward of Rectified Flow policies obtained an average net growth of 135. 36% after fine-tuning in challenging legged locomotion tasks while saving denoising steps and 82. 63% of wall time compared to state-of-the-art diffusion RL fine-tuning method DPPO [42]. The success rate of the Shortcut Model policies in state and visual manipulation tasks achieved an average net increase of 40. 34% after fine-tuning with ReinFlow at four or even one denoising step, whose performance is comparable to fine-tuned DDIM policies while saving computation time for an average of 23. 20%. Code, model, and checkpoints available on the project website: https: //reinflow. github. io/

NeurIPS Conference 2025 Conference Paper

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

  • Zelai Xu
  • Ruize Zhang
  • Chao Yu
  • Huining Yuan
  • Xiangmin Yi
  • Shilong Ji
  • Chuqi Wang
  • Wenhao Tang

Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative reinforcement learning (RL), multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy RL methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves 69. 5% win rate against the strongest baseline in the 3 vs 3 task, demonstrating its potential for tackling the complex interplay between low-level control and high-level strategy. To highlight VolleyBots’ sim-to-real potential, we further demonstrate the zero-shot deployment of a policy trained entirely in simulation on real-world drones.

NeurIPS Conference 2025 Conference Paper

What Can RL Bring to VLA Generalization? An Empirical Study

  • Jijia Liu
  • Feng Gao
  • Bingwen Wei
  • Xinlei Chen
  • Qingmin Liao
  • Yi Wu
  • Chao Yu
  • Yu Wang

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https: //rlvla. github. io

AAAI Conference 2024 Conference Paper

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

  • Jiayu Chen
  • Zelai Xu
  • Yunfei Li
  • Chao Yu
  • Jiaming Song
  • Huazhong Yang
  • Fei Fang
  • Yu Wang

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.

NeurIPS Conference 2024 Conference Paper

An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning

  • Qian Lin
  • Zongkai Liu
  • Danying Mo
  • Chao Yu

In recent years, significant progress has been made in multi-objective reinforcement learning (RL) research, which aims to balance multiple objectives by incorporating preferences for each objective. In most existing studies, specific preferences must be provided during deployment to indicate the desired policies explicitly. However, designing these preferences depends heavily on human prior knowledge, which is typically obtained through extensive observation of high-performing demonstrations with expected behaviors. In this work, we propose a simple yet effective offline adaptation framework for multi-objective RL problems without assuming handcrafted target preferences, but only given several demonstrations to implicitly indicate the preferences of expected policies. Additionally, we demonstrate that our framework can naturally be extended to meet constraints on safety-critical objectives by utilizing safe demonstrations, even when the safety thresholds are unknown. Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations.

AAMAS Conference 2024 Conference Paper

LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination

  • Jijia Liu
  • Chao Yu
  • Jiaxuan Gao
  • Yuqing Xie
  • Qingmin Liao
  • Yi Wu
  • Yu Wang

AI agents powered by Large Language Models (LLMs) have made significant advances, enabling them to assist humans in diverse complex tasks and leading to a revolution in human-AI coordination. LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. While this paradigm works well in scenarios with minimal interactive demands, such as code generation, it is unsuitable for highly interactive and real-time applications, such as gaming. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited task completion and interaction abilities. In this work, we consider Overcooked as our testbed where players could communicate with natural language and cooperate to serve orders. We propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. In particular, HLA adopts a hierarchical framework and comprises three modules: a proficient LLM, referred to as Slow Mind, for intention reasoning and language interaction, a lightweight LLM, referred to as Fast Mind, for generating macro actions, and a reactive policy, referred to as Executor, for transforming macro actions into atomic actions. Human studies show that HLA outperforms other baseline agents, including slow-mind-only agents and fast-mind-only agents, with stronger cooperation abilities, faster responses, and more consistent language communications.

AAMAS Conference 2024 Conference Paper

Policy-regularized Offline Multi-objective Reinforcement Learning

  • Qian Lin
  • Chao Yu
  • Zongkai Liu
  • Zifan Wu

In this paper, we aim to utilize only offline trajectory data to train a policy for multi-objective RL. We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting in order to achieve the above goal. However, such methods face a new challenge in offline MORL settings, namely the preference-inconsistent demonstration problem. We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness. Moreover, we integrate the preferenceconditioned scalarized update method into policy-regularized offline RL, in order to simultaneously learn a set of policies using a single policy network, thus reducing the computational costs induced by the training of a large number of individual policies for various preferences. Finally, we introduce Regularization Weight Adaptation to dynamically determine appropriate regularization weights for arbitrary target preferences during deployment. Empirical results on various multi-objective datasets demonstrate the capability of our approach in solving offline MORL problems. 1

IROS Conference 2024 Conference Paper

RCAL: A Lightweight Road Cognition and Automated Labeling System for Autonomous Driving Scenarios

  • Jiancheng Chen
  • Chao Yu
  • Huayou Wang
  • Kun Liu
  • Yifei Zhan
  • Xianpeng Lang
  • Changliang Xue

Vectorized reconstruction and topological cognition of road structures are crucial for autonomous vehicles to handle complex scenes. Traditional frameworks rely heavily on high-definition (HD) maps, which place significant demands on storage, computation, and manual labor. To overcome these limitations, we introduce a lightweight Road Cognition and Automated Labeling (RCAL) system. It leverages lightweight road data captured from mass-produced vehicles to vectorize road elements and cognize their topology. RCAL compiles multi-trip data on cloud servers for enhanced accuracy and coverage, addressing the limitations of single-trip data. In the field of element extraction, we proposed a pivotal point priority sampling strategy that can balance the contradiction between road scale and processing efficiency. Additionally, traffic flow is utilized to enhance the accuracy of road topology cognition. With its impressive automation, reliability, and efficiency, RCAL stands as an advanced solution in the field. Our evaluations on the intersection dataset from the real world confirm that RCAL not only achieves comparable precision to traditional HD map labeling systems but also substantially reducing resource costs.

TMLR Journal 2024 Journal Article

Revisiting Discrete Soft Actor-Critic

  • haibin zhou
  • Tong Wei
  • Zichuan Lin
  • Junyou Li
  • Junliang Xing
  • Yuanchun Shi
  • Li Shen
  • Chao Yu

We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.

AAMAS Conference 2023 Conference Paper

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

  • Chao Yu
  • Xinyi Yang
  • Jiaxuan Gao
  • Jiayu Chen
  • Yunfei Li
  • Jijia Liu
  • Yunfei Xiang
  • Ruixin Huang

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i. e. , every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

IJCAI Conference 2023 Conference Paper

Automatic Truss Design with Reinforcement Learning

  • Weihua Du
  • Jinglun Zhao
  • Chao Yu
  • Xingcheng Yao
  • Zimeng Song
  • Siyang Wu
  • Ruifeng Luo
  • Zhiyuan Liu

Truss layout design, namely finding a lightweight truss layout satisfying all the physical constraints, is a fundamental problem in the building industry. Generating the optimal layout is a challenging combinatorial optimization problem, which can be extremely expensive to solve by exhaustive search. Directly applying end-to-end reinforcement learning (RL) methods to truss layout design is infeasible either, since only a tiny portion of the entire layout space is valid under the physical constraints, leading to particularly sparse rewards for RL training. In this paper, we develop AutoTruss, a two-stage framework to efficiently generate both lightweight and valid truss layouts. AutoTruss first adopts Monte Carlo tree search to discover a diverse collection of valid layouts. Then RL is applied to iteratively refine the valid solutions. We conduct experiments and ablation studies in popular truss layout design test cases in both 2D and 3D settings. AutoTruss outperforms the best-reported layouts by 25. 1% in the most challenging 3D test cases, resulting in the first effective deep-RL-based approach in the truss layout design literature.

IJCAI Conference 2023 Conference Paper

Causal Deep Reinforcement Learning Using Observational Data

  • Wenxuan Zhu
  • Chao Yu
  • Qiang Zhang

Deep reinforcement learning (DRL) requires the collection of interventional data, which is sometimes expensive and even unethical in the real world, such as in the autonomous driving and the medical field. Offline reinforcement learning promises to alleviate this issue by exploiting the vast amount of observational data available in the real world. However, observational data may mislead the learning agent to undesirable outcomes if the behavior policy that generates the data depends on unobserved random variables (i. e. , confounders). In this paper, we propose two deconfounding methods in DRL to address this problem. The methods first calculate the importance degree of different samples based on the causal inference technique, and then adjust the impact of different samples on the loss function by reweighting or resampling the offline dataset to ensure its unbiasedness. These deconfounding methods can be flexibly combined with existing model-free DRL algorithms such as soft actor-critic and deep Q-learning, provided that a weak condition can be satisfied by the loss functions of these algorithms. We prove the effectiveness of our deconfounding methods and validate them experimentally.

AAMAS Conference 2023 Conference Paper

Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed Cooperative-Competitive Games

  • Zelai Xu
  • Yancheng Liang
  • Chao Yu
  • Yu Wang
  • Yi Wu

Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counterexample where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w. r. t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperativecompetitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and crossplay against the counter population, while the counter policies are trained as the best responses to the main policy’s past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.

AAAI Conference 2023 Conference Paper

Hierarchical Mean-Field Deep Reinforcement Learning for Large-Scale Multiagent Systems

  • Chao Yu

Learning for efficient coordination in large-scale multiagent systems suffers from the problem of the curse of dimensionality due to the exponential growth of agent interactions. Mean-Field (MF)-based methods address this issue by transforming the interactions within the whole system into a single agent played with the average effect of its neighbors. However, considering the neighbors merely by their average may ignore the varying influences of each neighbor, and learning with this kind of local average effect would likely lead to inferior system performance due to lack of an efficient coordination mechanism in the whole population level. In this work, we propose a Hierarchical Mean-Field (HMF) learning framework to further improve the performance of existing MF methods. The basic idea is to approximate the average effect for a sub-group of agents by considering their different influences within the sub-group, and realize population-level coordination through the interactions among different sub-groups. Empirical studies show that HMF significantly outperforms existing baselines on both challenging cooperative and mixed cooperative-competitive tasks with different scales of agent populations.

NeurIPS Conference 2023 Conference Paper

Hybrid Policy Optimization from Imperfect Demonstrations

  • Hanlin Yang
  • Chao Yu
  • Peng Sun
  • Siji Chen

Exploration is one of the main challenges in Reinforcement Learning (RL), especially in environments with sparse rewards. Learning from Demonstrations (LfD) is a promising approach to solving this problem by leveraging expert demonstrations. However, expert demonstrations of high quality are usually costly or even impossible to collect in real-world applications. In this work, we propose a novel RL algorithm called HYbrid Policy Optimization (HYPO), which uses a small number of imperfect demonstrations to accelerate an agent's online learning process. The key idea is to train an offline guider policy using imitation learning in order to instruct an online agent policy to explore efficiently. Through mutual update of the guider policy and the agent policy, the agent can leverage suboptimal demonstrations for efficient exploration while avoiding the conservative policy caused by imperfect demonstrations. Empirical results show that HYPO significantly outperforms several baselines in various challenging tasks, such as MuJoCo with sparse rewards, Google Research Football, and the AirSim drone simulation.

AAMAS Conference 2023 Conference Paper

Learning Graph-Enhanced Commander-Executor for Multi-Agent Navigation

  • Xinyi Yang
  • Shiyu Huang
  • Yiwen Sun
  • Yuxiang Yang
  • Chao Yu
  • Wei-Wei Tu
  • Huazhong Yang
  • Yu Wang

This paper investigates the multi-agent navigation problem, which requires multiple agents to reach the target goals in a limited time. Multi-agent reinforcement learning (MARL) has shown promising results for solving this issue. However, it is inefficient for MARL to directly explore the (nearly) optimal policy in the large search space, which is exacerbated as the agent number increases (e. g. , 10+ agents) or the environment is more complex (e. g. , 3𝐷 simulator). Goal-conditioned hierarchical reinforcement learning (HRL) provides a promising direction to tackle this challenge by introducing a hierarchical structure to decompose the search space, where the low-level policy predicts primitive actions in the guidance of the goals derived from the high-level policy. In this paper, we propose Multi-Agent Graph-Enhanced Commander-EXecutor (MAGE-X), a graph-based goal-conditioned hierarchical method for multi-agent navigation tasks. MAGE-X comprises a high-level Goal Commander and a low-level Action Executor. The Goal Commander predicts the probability distribution of the goals and leverages them to assign the most appropriate final target to each agent. The Action Executor utilizes graph neural networks (GNN) to construct a subgraph for each agent that only contains its crucial partners to improve cooperation. Additionally, the Goal Encoder in the Action Executor captures the relationship between the agent and the designated goal to encourage the agent to reach the final target. The results show that MAGE-X outperforms the state-of-the-art MARL baselines with a 100% success rate with only 3 million training steps in multi-agent particle environments (MPE) with 50 agents, and at least a 12% higher success rate and 2× higher data efficiency in a more complicated quadrotor 3𝐷 navigation task.

AAAI Conference 2023 Conference Paper

Models as Agents: Optimizing Multi-Step Predictions of Interactive Local Models in Model-Based Multi-Agent Reinforcement Learning

  • Zifan Wu
  • Chao Yu
  • Chen Chen
  • Jianye Hao
  • Hankz Hankui Zhuo

Research in model-based reinforcement learning has made significant progress in recent years. Compared to single-agent settings, the exponential dimension growth of the joint state-action space in multi-agent systems dramatically increases the complexity of the environment dynamics, which makes it infeasible to learn an accurate global model and thus necessitates the use of agent-wise local models. However, during multi-step model rollouts, the prediction of one local model can affect the predictions of other local models in the next step. As a result, local prediction errors can be propagated to other localities and eventually give rise to considerably large global errors. Furthermore, since the models are generally used to predict for multiple steps, simply minimizing one-step prediction errors regardless of their long-term effect on other models may further aggravate the propagation of local errors. To this end, we propose Models as AGents (MAG), a multi-agent model optimization framework that reversely treats the local models as multi-step decision making agents and the current policies as the dynamics during the model rollout process. In this way, the local models are able to consider the multi-step mutual affect between each other before making predictions. Theoretically, we show that the objective of MAG is approximately equivalent to maximizing a lower bound of the true environment return. Experiments on the challenging StarCraft II benchmark demonstrate the effectiveness of MAG.

AAAI Conference 2023 Conference Paper

Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks

  • Pei Xu
  • Junge Zhang
  • Qiyue Yin
  • Chao Yu
  • Yaodong Yang
  • Kaiqi Huang

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

NeurIPS Conference 2022 Conference Paper

A Unified Diversity Measure for Multiagent Reinforcement Learning

  • Zongkai Liu
  • Chao Yu
  • Yaodong Yang
  • Peng Sun
  • Zifan Wu
  • Yuan Li

Promoting behavioural diversity is of critical importance in multi-agent reinforcement learning, since it helps the agent population maintain robust performance when encountering unfamiliar opponents at test time, or, when the game is highly non-transitive in the strategy space (e. g. , Rock-Paper-Scissor). While a myriad of diversity metrics have been proposed, there are no widely accepted or unified definitions in the literature, making the consequent diversity-aware learning algorithms difficult to evaluate and the insights elusive. In this work, we propose a novel metric called the Unified Diversity Measure (UDM) that offers a unified view for existing diversity metrics. Based on UDM, we design the UDM-Fictitious Play (UDM-FP) and UDM-Policy Space Response Oracle (UDM-PSRO) algorithms as efficient solvers for normal-form games and open-ended games. In theory, we prove that UDM-based methods can enlarge the gamescape by increasing the response capacity of the strategy pool, and have convergence guarantee to two-player Nash equilibrium. We validate our algorithms on games that show strong non-transitivity, and empirical results show that our algorithms achieve better performances than strong PSRO baselines in terms of the exploitability and population effectivity.

AAAI Conference 2022 Conference Paper

Creativity of AI: Automatic Symbolic Option Discovery for Facilitating Deep Reinforcement Learning

  • Mu Jin
  • Zhihao Ma
  • Kebing Jin
  • Hankz Hankui Zhuo
  • Chen Chen
  • Chao Yu

Despite of achieving great success in real life, Deep Reinforcement Learning (DRL) is still suffering from three critical issues, which are data efficiency, lack of the interpretability and transferability. Recent research shows that embedding symbolic knowledge into DRL is promising in addressing those challenges. Inspired by this, we introduce a novel deep reinforcement learning framework with symbolic options. This framework features a loop training procedure, which enables guiding the improvement of policy by planning with action models and symbolic options learned from interactive trajectories automatically. The learned symbolic options alleviate the dense requirement of expert domain knowledge and provide inherent interpretability of policies. Moreover, the transferability and data efficiency can be further improved by planning with the action models. To validate the effectiveness of this framework, we conduct experiments on two domains, Montezuma’s Revenge and Office World, respectively. The results demonstrate the comparable performance, improved data efficiency, interpretability and transferability.

NeurIPS Conference 2022 Conference Paper

Plan To Predict: Learning an Uncertainty-Foreseeing Model For Model-Based Reinforcement Learning

  • Zifan Wu
  • Chao Yu
  • Chen Chen
  • Jianye Hao
  • Hankz Hankui Zhuo

In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose Plan To Predict (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.

NeurIPS Conference 2022 Conference Paper

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

  • Chao Yu
  • Akash Velu
  • Eugene Vinitsky
  • Jiaxuan Gao
  • Yu Wang
  • Alexandre Bayen
  • Yi Wu

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, the Hanabi challenge, and Google Research Football, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods are a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at https: //github. com/marlbenchmark/on-policy.

AAAI Conference 2021 Conference Paper

A Joint Training Dual-MRC Framework for Aspect Based Sentiment Analysis

  • Yue Mao
  • YI Shen
  • Chao Yu
  • Longjun Cai

Aspect based sentiment analysis (ABSA) involves three fundamental subtasks: aspect term extraction, opinion term extraction, and aspect-level sentiment classification. Early works only focused on solving one of these subtasks individually. Some recent work focused on solving a combination of two subtasks, e. g. , extracting aspect terms along with sentiment polarities or extracting the aspect and opinion terms pair-wisely. More recently, the triple extraction task has been proposed, i. e. , extracting the (aspect term, opinion term, sentiment polarity) triples from a sentence. However, previous approaches fail to solve all subtasks in a unified end-to-end framework. In this paper, we propose a complete solution for ABSA. We construct two machine reading comprehension (MRC) problems and solve all subtasks by joint training two BERT-MRC models with parameters sharing. We conduct experiments on these subtasks, and results on several benchmark datasets demonstrate the effectiveness of our proposed framework, which significantly outperforms existing state-ofthe-art methods.

NeurIPS Conference 2021 Conference Paper

Coordinated Proximal Policy Optimization

  • Zifan Wu
  • Chao Yu
  • Deheng Ye
  • Junge Zhang
  • Haiyin Piao
  • Hankz Hankui Zhuo

We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective, and derive a simplified optimization objective based on a set of approximations. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies. Finally, we demonstrate that CoPPO outperforms several strong baselines and is competitive with the latest multi-agent PPO method (i. e. MAPPO) under typical multi-agent settings, including cooperative matrix games and the StarCraft II micromanagement tasks.

AAMAS Conference 2019 Conference Paper

Coordinated Multiagent Reinforcement Learning for Teams of Mobile Sensing Robots

  • Chao Yu
  • Xin Wang
  • Zhanbo Feng

A mobile sensing robot team (MSRT) is a typical application of multi-agent systems. This paper investigates multiagent reinforcement learning in the MSRT problem. A naive coordinated learning approach is first proposed that uses a coordination graph to model interaction relationships among robots. To further reduce the computation complexity in the context of continuously changing topology caused by robots’ movement, we then propose an on-line transfer learning method that is capable of transferring the past interaction experience and learned knowledge to a new context in a dynamic environment. Simulations verify that the method can achieve reasonable team performance by properly balancing robots’ local selfish interests and global team performance.

IJCAI Conference 2019 Conference Paper

Large-Scale Home Energy Management Using Entropy-Based Collective Multiagent Deep Reinforcement Learning Framework

  • Yaodong Yang
  • Jianye Hao
  • Yan Zheng
  • Chao Yu

Smart grids are contributing to the demand-side management by integrating electronic equipment, distributed energy generation and storage and advanced meters and controllers. With the increasing adoption of electric vehicles and distributed energy generation and storage systems, residential energy management is drawing more and more attention, which is regarded as being critical to demand-supply balancing and peak load reduction. In this paper, we focus on a microgrid scenario in which modern homes interact together under a large-scale setting to better optimize their electricity cost. We first make households form a group with an economic stimulus. Then we formulate the energy expense optimization problem of the household community as a multi-agent coordination problem and present an Entropy-Based Collective Multiagent Deep Reinforcement Learning (EB-C-MADRL) framework to address it. Experiments with various real-world data demonstrate that EB-C-MADRL can reduce both the long-term group power consumption cost and daily peak demand effectively compared with existing approaches.

AAMAS Conference 2019 Conference Paper

Reinforcement Learning for Cooperative Overtaking

  • Chao Yu
  • Xin Wang
  • Jianye Hao
  • Zhanbo Feng

This paper solves the cooperative overtaking problem in autonomous driving using reinforcement learning techniques. Learning in such a situation is challenging due to vehicular mobility, which renders a continuously changing environment for each learning vehicle. Without no explicit coordination mechanisms, inefficient behaviors among vehicles might cause fatal uncoordinated outcomes. To solve this issue, we propose two basic coordination models to enable distributed learning of cooperative overtaking maneuvers in a group of vehicles. Extension mechanisms are then presented to make these models workable in more complex and realistic settings with any number of vehicles. Experiments verify that, by capturing the underlying consistency of identities or positions during vehicles’ movement, efficient coordinated behaviors can be achieved simply through vehicles’ local learning interactions.

IJCAI Conference 2019 Conference Paper

The Price of Governance: A Middle Ground Solution to Coordination in Organizational Control

  • Chao Yu
  • Guozhen Tan

Achieving coordination is crucial in organizational control. This paper investigates a middle ground solution between decentralized interactions and centralized administrations for coordinating agents beyond inefficient behavior. We first propose the price of governance (PoG) to evaluate how such a middle ground solution performs in terms of effectiveness and cost. We then propose a hierarchical supervision framework to explicitly model the PoG, and define step by step how to realize the core principle of the framework and compute the optimal PoG for a control problem. Two illustrative case studies are carried out to exemplify the applications of the proposed framework and its methodology. Results show that the hierarchical supervision framework is capable of promoting coordination among agents while bounding administrative cost to a minimum in different kinds of organizational control problems.

TAAS Journal 2017 Journal Article

Efficient and Robust Emergence of Norms through Heuristic Collective Learning

  • Jianye Hao
  • Jun Sun
  • Guangyong Chen
  • Zan Wang
  • Chao Yu
  • Zhong Ming

In multiagent systems, social norms serves as an important technique in regulating agents’ behaviors to ensure effective coordination among agents without a centralized controlling mechanism. In such a distributed environment, it is important to investigate how a desirable social norm can be synthesized in a bottom-up manner among agents through repeated local interactions and learning techniques. In this article, we propose two novel learning strategies under the collective learning framework, collective learning EV-l and collective learning EV-g, to efficiently facilitate the emergence of social norms. Extensive simulations results show that both learning strategies can support the emergence of desirable social norms more efficiently and be applicable in a wider range of multiagent interaction scenarios compared with previous work. The influence of different topologies is investigated, which shows that the performance of all strategies is robust across different network topologies. The influences of a number of key factors (neighborhood size, actions space, population size, fixed agents and isolated subpopulations) on norm emergence performance are investigated as well.

AAMAS Conference 2016 Conference Paper

An Adaptive Learning Framework for Efficient Emergence of Social Norms (Extended Abstract)

  • Chao Yu
  • Hongtao Lv
  • Sandip Sen
  • Jianye Hao
  • Fenghui Ren
  • Rui Liu

This paper investigates how norm emergence can be facilitated by agents’ adaptive learning behaviors. A general learning framework is proposed, in which agents can dynamically adapt their learning behaviors through social learning of their individual learning experience. Experimental results indicate that the proposed framework outperforms the static learning framework in various comparison criteria.