Arrow Research search

Author name cluster

Deheng Ye

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

AAAI Conference 2026 Conference Paper

Multi-agent In-context Coordination via Decentralized Memory Retrieval

  • Tao Jiang
  • Zichuan Lin
  • Lihe Li
  • Yi-Chen Li
  • Cong Guan
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods.

ICML Conference 2025 Conference Paper

Ad Hoc Teamwork via Offline Goal-Based Decision Transformers

  • Xinzhi Zhang 0009
  • Hohei Chan
  • Deheng Ye
  • Yi Cai 0001
  • Mengchen Zhao

The ability of agents to collaborate with previously unknown teammates on the fly, known as ad hoc teamwork (AHT), is crucial in many real-world applications. Existing approaches to AHT require online interactions with the environment and some carefully designed teammates. However, these prerequisites can be infeasible in practice. In this work, we extend the AHT problem to the offline setting, where the policy of the ego agent is directly learned from a multi-agent interaction dataset. We propose a hierarchical sequence modeling framework called TAGET that addresses critical challenges in the offline setting, including limited data, partial observability and online adaptation. The core idea of TAGET is to dynamically predict teammate-aware rewards-to-go and sub-goals, so that the ego agent can adapt to the changes of teammates’ behaviors in real time. Extensive experimental results show that TAGET significantly outperforms existing solutions to AHT in the offline setting.

ICLR Conference 2025 Conference Paper

Cross-Domain Offline Policy Adaptation with Optimal Transport and Dataset Constraint

  • Jiafei Lyu
  • Mengbei Yan
  • Zhongjian Qiao
  • Runze Liu 0002
  • Xiaoteng Ma
  • Deheng Ye
  • Jingwen Yang
  • Zongqing Lu 0002

We explore cross-domain offline reinforcement learning (RL) where offline datasets from another domain can be accessed to facilitate policy learning. However, the underlying environments of the two datasets may have dynamics mismatches, incurring inferior performance when simply merging the data of two domains. Existing methods mitigate this issue by training domain classifiers, using contrastive learning methods, etc. Nevertheless, they still rely on a large amount of target domain data to function well. Instead, we address this problem by establishing a concrete performance bound of a policy given datasets from two domains. Motivated by the theoretical insights, we propose to align transitions in the two datasets using optimal transport and selectively share source domain samples, without training any neural networks. This enables reliable data filtering even given a few target domain data. Additionally, we introduce a dataset regularization term that ensures the learned policy remains within the scope of the target domain dataset, preventing it from being biased towards the source domain data. Consequently, we propose the Optimal Transport Data Filtering (dubbed OTDF) method and examine its effectiveness by conducting extensive experiments across various dynamics shift conditions (e.g., gravity shift), given limited target domain data. It turns out that OTDF exhibits superior performance on many tasks and dataset qualities, often surpassing prior strong baselines by a large margin.

ICML Conference 2025 Conference Paper

Decision Mixer: Integrating Long-term and Local Dependencies via Dynamic Token Selection for Decision-Making

  • Hongling Zheng
  • Li Shen 0008
  • Yong Luo 0002
  • Deheng Ye
  • Bo Du 0001
  • Jialie Shen 0001
  • Dacheng Tao

The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer’s powerful distribution modeling capabilities, has demonstrated considerable promise in offline Reinforcement Learning (RL) tasks. Depending on the task’s nature, it is crucial to carefully balance the interplay between inherent local features and long-term dependencies in Markov decision trajectories to mitigate potential performance degradation and unnecessary computational overhead. In this paper, we propose Decision Mixer (DM), which addresses the conflict between features of different scales in the modeling process from the perspective of dynamic integration. Drawing inspiration from conditional computation, we design a plug-and-play dynamic token selection mechanism to ensure the model can effectively allocate attention to different features based on task characteristics. Additionally, we employ an auxiliary predictor to alleviate the short-sightedness issue in the autoregressive sampling process. DM achieves state-of-the-art performance on various standard RL benchmarks while requiring significantly fewer computational resources, offering a viable solution for building efficient and scalable RL foundation models. Code is available at here.

NeurIPS Conference 2025 Conference Paper

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

  • Hao Luo
  • Zihao Yue
  • Wanpeng Zhang
  • Yicheng Feng
  • Sipeng Zheng
  • Deheng Ye
  • Zongqing Lu

Recent advances in large multimodal models have significantly advanced video comprehension, yet their performance remains limited in first-person scenarios. The interactive nature of egocentric videos is critical for applications like embodied intelligence, but introduces complex visual contexts that conventional models struggle to capture. To bridge this gap, we introduce OpenMMEgo with innovations across three dimensions: data, model, and training strategy. To provide rich spatiotemporal visual knowledge, we curate a large-scale, high-quality dataset named OME10M, comprising over 8. 2M egocentric video QA pairs synthesized from Ego4D series. We also establish OMEBench, a comprehensive benchmark for rigorous egocentric understanding assessment. To alleviate the frequent viewpoint shifts inherent in egocentric videos, we implement semantic-aware visual token compression. Further, a curriculum learning strategy is complemented to foster stable learning across various data complexities. OpenMMEgo consistently improves the performance of LMMs on egocentric benchmarks without sacrificing general video understanding performance. Notably, Qwen2. 5-VL tuned with OpenMMEgo substantially outperforms other models of the same size in egocentric video understanding. The data, weights and training code will be put at https: //github. com/BeingBeyond/OpenMMEgo.

NeurIPS Conference 2025 Conference Paper

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

  • Longxiang He
  • Deheng Ye
  • Junbo Tan
  • Xueqian Wang
  • Li Shen

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios.

NeurIPS Conference 2025 Conference Paper

Value-Guided Decision Transformer: A Unified Reinforcement Learning Framework for Online and Offline Settings

  • Hongling Zheng
  • Li Shen
  • Yong Luo
  • Deheng Ye
  • Shuhan Xu
  • Bo Du
  • Jialie Shen
  • Dacheng Tao

The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in Reinforcement Learning (RL) tasks. However, much of the work has focused on applying CSM to single online or offline settings, with the general architecture rarely explored. Additionally, existing methods primarily focus on deterministic trajectory modeling, overlooking the randomness of state transitions and the diversity of future trajectory distributions. Fortunately, value-based methods offer a viable solution for CSM, further bridging the potential gap between offline and online RL. In this paper, we propose Value-Guided Decision Transformer (VDT), which leverages value functions to perform advantage-weighting and behavior regularization on the Decision Transformer (DT), guiding the policy toward upper-bound optimal decisions during the offline training phase. In the online tuning phase, VDT further integrates value-based policy improvement with behavior cloning under the CSM architecture through limited interaction and data collection, achieving performance improvement within minimal timesteps. The predictive capability of value functions for future returns is also incorporated into the sampling process. Our method achieves competitive performance on various standard RL benchmarks, providing a feasible solution for developing CSM architectures in general scenarios. Code is available at here.

NeurIPS Conference 2025 Conference Paper

WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

  • Siyu Zhou
  • Tianyi Zhou
  • Yijun Yang
  • Guodong Long
  • Deheng Ye
  • Jing Jiang
  • Chengqi Zhang

Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents? The gap between the prior knowledge of LLMs and the specified environment's dynamics usually bottlenecks LLMs' performance as world models. To bridge the gap, we propose a training-free "world alignment" that learns an environment's symbolic knowledge complementary to LLMs. The symbolic knowledge covers action rules, knowledge graphs, and scene graphs, which are extracted by LLMs from exploration trajectories and encoded into executable codes to regulate LLM agents' policies. We further propose an RL-free, model-based agent "WALL-E" through the model-predictive control (MPC) framework. Unlike classical MPC requiring costly optimization on the fly, we adopt an LLM agent as an efficient look-ahead optimizer of future steps' actions by interacting with the neurosymbolic world model. While the LLM agent's strong heuristics make it an efficient planner in MPC, the quality of its planned actions is also secured by the accurate predictions of the aligned world model. They together considerably improve learning efficiency in a new environment. On open-world challenges in Mars (Minecraft like) and ALFWorld (embodied indoor environments), WALL-E significantly outperforms existing methods, e. g. , surpassing baselines in Mars by 16. 1%–51. 6% of success rate and by at least 61. 7% in score. In ALFWorld, it achieves a new record 98% success rate after only 4 iterations.

TMLR Journal 2024 Journal Article

Affordable Generative Agents

  • Yangbin Yu
  • Qin Zhang
  • Junyou Li
  • Qiang Fu
  • Deheng Ye

The emergence of large language models (LLMs) has significantly advanced the simulation of believable interactive agents. However, the substantial cost on maintaining the prolonged agent interactions poses challenge over the deployment of believable LLM-based agents. Therefore, in this paper, we develop Affordable Generative Agents (AGA), a framework for enabling the generation of believable and low-cost interactions on both agent-environment and inter-agents. Specifically, for agent-environment interactions, we substitute repetitive LLM inferences with learned policies; while for inter-agent interactions, we model the social relationships between agents and compress auxiliary dialogue information. Extensive experiments on multiple environments show the effectiveness and efficiency of our proposed framework. Also, we delve into the mechanisms of emergent believable behaviors lying in LLM agents, demonstrating that agents can only generate finite behaviors in fixed environments, based upon which, we understand ways to facilitate emergent interaction behaviors. Our code is publicly available at: https://github.com/AffordableGenerativeAgents/Affordable-Generative-Agents.

NeurIPS Conference 2024 Conference Paper

Learning Versatile Skills with Curriculum Masking

  • Yao Tang
  • Zhihui Xie
  • Zichuan Lin
  • Deheng Ye
  • Shuai Li

Masked prediction has emerged as a promising pretraining paradigm in offline reinforcement learning (RL) due to its versatile masking schemes, enabling flexible inference across various downstream tasks with a unified model. Despite the versatility of masked prediction, it remains unclear how to balance the learning of skills at different levels of complexity. To address this, we propose CurrMask, a curriculum masking pretraining paradigm for sequential decision making. Motivated by how humans learn by organizing knowledge in a curriculum, CurrMask adjusts its masking scheme during pretraining for learning versatile skills. Through extensive experiments, we show that CurrMask exhibits superior zero-shot performance on skill prompting tasks, goal-conditioned planning tasks, and competitive finetuning performance on offline RL tasks. Additionally, our analysis of training dynamics reveals that CurrMask gradually acquires skills of varying complexity by dynamically adjusting its masking scheme.

TMLR Journal 2024 Journal Article

More Agents Is All You Need

  • Junyou Li
  • Qin Zhang
  • Yangbin Yu
  • Qiang Fu
  • Deheng Ye

We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method, termed as Agent Forest, is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: https://github.com/MoreAgentsIsAllYouNeed/AgentForest

TMLR Journal 2024 Journal Article

Revisiting Discrete Soft Actor-Critic

  • haibin zhou
  • Tong Wei
  • Zichuan Lin
  • Junyou Li
  • Junliang Xing
  • Yuanchun Shi
  • Li Shen
  • Chao Yu

We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.

TMLR Journal 2023 Journal Article

A Survey on Transformers in Reinforcement Learning

  • Wenzhe Li
  • Hao Luo
  • Zichuan Lin
  • Chongjie Zhang
  • Zongqing Lu
  • Deheng Ye

Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

ICML Conference 2023 Conference Paper

Future-conditioned Unsupervised Pretraining for Decision Transformer

  • Zhihui Xie 0002
  • Zichuan Lin
  • Deheng Ye
  • Qiang Fu 0016
  • Yang Wei
  • Shuai Li 0010

Recent research in offline reinforcement learning (RL) has demonstrated that return-conditioned supervised learning is a powerful paradigm for decision-making problems. While promising, return conditioning is limited to training data labeled with rewards and therefore faces challenges in learning from unsupervised data. In this work, we aim to utilize generalized future conditioning to enable efficient unsupervised pretraining from reward-free and sub-optimal offline data. We propose Pretrained Decision Transformer (PDT), a conceptually simple approach for unsupervised RL pretraining. PDT leverages future trajectory information as a privileged context to predict actions during training. The ability to make decisions based on both present and future factors enhances PDT’s capability for generalization. Besides, this feature can be easily incorporated into a return-conditioned framework for online finetuning, by assigning return values to possible futures and sampling future embeddings based on their respective values. Empirically, PDT outperforms or performs on par with its supervised pretraining counterpart, especially when dealing with sub-optimal data. Further analysis reveals that PDT can extract diverse behaviors from offline data and controllably sample high-return behaviors by online finetuning. Code is available at here.

NeurIPS Conference 2023 Conference Paper

Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

  • Yun Qu
  • Boyuan Wang
  • Jianzhun Shao
  • Yuhang Jiang
  • Chen Chen
  • Zhenbin Ye
  • Liu Linc
  • Yang Feng

The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

ICLR Conference 2023 Conference Paper

More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization

  • Jiangxing Wang
  • Deheng Ye
  • Zongqing Lu 0002

In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.

NeurIPS Conference 2023 Conference Paper

Mutual-Information Regularized Multi-Agent Policy Iteration

  • Wang Wang
  • Deheng Ye
  • Zongqing Lu

Despite the success of cooperative multi-agent reinforcement learning algorithms, most of them focus on a single team composition, which prevents them from being used in more realistic scenarios where dynamic team composition is possible. While some studies attempt to solve this problem via multi-task learning in a fixed set of team compositions, there is still a risk of overfitting to the training set, which may lead to catastrophic performance when facing dramatically varying team compositions during execution. To address this problem, we propose to use mutual information (MI) as an augmented reward to prevent individual policies from relying too much on team-related information and encourage agents to learn policies that are robust in different team compositions. Optimizing this MI-augmented objective in an off-policy manner can be intractable due to the existence of dynamic marginal distribution. To alleviate this problem, we first propose a multi-agent policy iteration algorithm with a fixed marginal distribution and prove its convergence and optimality. Then, we propose to employ the Blahut–Arimoto algorithm and an imaginary team composition distribution for optimization with approximate marginal distribution as the practical implementation. Empirically, our method demonstrates strong zero-shot generalization to dynamic team compositions in complex cooperative tasks.

TMLR Journal 2023 Journal Article

Replay-enhanced Continual Reinforcement Learning

  • Tiantian Zhang
  • Kevin Zehua Shen
  • Zichuan Lin
  • Bo Yuan
  • Xueqian Wang
  • Xiu Li
  • Deheng Ye

Replaying past experiences has proven to be a highly effective approach for averting catastrophic forgetting in supervised continual learning. However, some crucial factors are still largely ignored, making it vulnerable to serious failure, when used as a solution to forgetting in continual reinforcement learning, even in the context of perfect memory where all data of previous tasks are accessible in the current task. On the one hand, since most reinforcement learning algorithms are not invariant to the reward scale, the previously well-learned tasks (with high rewards) may appear to be more salient to the current learning process than the current task (with small initial rewards). This causes the agent to concentrate on those salient tasks at the expense of generality on the current task. On the other hand, offline learning on replayed tasks while learning a new task may induce a distributional shift between the dataset and the learned policy on old tasks, resulting in forgetting. In this paper, we introduce RECALL, a replay-enhanced method that greatly improves the plasticity of existing replay-based methods on new tasks while effectively avoiding the recurrence of catastrophic forgetting in continual reinforcement learning. RECALL leverages adaptive normalization on approximate targets and policy distillation on old tasks to enhance generality and stability, respectively. Extensive experiments on the Continual World benchmark show that RECALL performs significantly better than purely perfect memory replay, and achieves comparable or better overall performance against state-of-the-art continual learning methods.

AAAI Conference 2023 Conference Paper

RLogist: Fast Observation Strategy on Whole-Slide Images with Deep Reinforcement Learning

  • Boxuan Zhao
  • Jun Zhang
  • Deheng Ye
  • Jian Cao
  • Xiao Han
  • Qiang Fu
  • Wei Yang

Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: https://github.com/tencent-ailab/RLogist.

TMLR Journal 2023 Journal Article

RLTF: Reinforcement Learning from Unit Test Feedback

  • Jiate Liu
  • Yiqin Zhu
  • Kaiwen Xiao
  • Qiang Fu
  • Xiao Han
  • Yang Wei
  • Deheng Ye

The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, some of the current representative RL methods have only used offline frameworks, limiting the exploration of new sample spaces. Additionally, the utilization of unit test signals is limited, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code is available at: \url{https://github.com/Zyq-scut/RLTF}.

NeurIPS Conference 2022 Conference Paper

Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

  • Hua Wei
  • Jingxiao Chen
  • Xiyang Ji
  • Hongyang Qin
  • Minwen Deng
  • Siqin Li
  • Liang Wang
  • Weinan Zhang

This paper introduces Honor of Kings Arena, a reinforcement learning (RL) environment based on the Honor of Kings, one of the world’s most popular games at present. Compared to other environments studied in most previous work, ours presents new generalization challenges for competitive reinforcement learning. It is a multi-agent problem with one agent competing against its opponent; and it requires the generalization ability as it has diverse targets to control and diverse opponents to compete with. We describe the observation, action, and reward specifications for the Honor of Kings domain and provide an open-source Python-based interface for communicating with the game engine. We provide twenty target heroes with a variety of tasks in Honor of Kings Arena and present initial baseline results for RL-based methods with feasible computing resources. Finally, we showcase the generalization challenges imposed by Honor of Kings Arena and possible remedies to the challenges. All of the software, including the environment-class, are publicly available.

IJCAI Conference 2022 Conference Paper

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

  • Zichuan Lin
  • Junyou Li
  • Jianing Shi
  • Deheng Ye
  • Qiang Fu
  • Wei Yang

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

IJCAI Conference 2021 Conference Paper

Boosting Offline Reinforcement Learning with Residual Generative Modeling

  • Hua Wei
  • Deheng Ye
  • Zhao Liu
  • Hao Wu
  • Bo Yuan
  • Qiang Fu
  • Wei Yang
  • Zhenhui Li

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i. e. , approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

NeurIPS Conference 2021 Conference Paper

Coordinated Proximal Policy Optimization

  • Zifan Wu
  • Chao Yu
  • Deheng Ye
  • Junge Zhang
  • Haiyin Piao
  • Hankz Hankui Zhuo

We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective, and derive a simplified optimization objective based on a set of approximations. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies. Finally, we demonstrate that CoPPO outperforms several strong baselines and is competitive with the latest multi-agent PPO method (i. e. MAPPO) under typical multi-agent settings, including cooperative matrix games and the StarCraft II micromanagement tasks.

NeurIPS Conference 2021 Conference Paper

Learning Diverse Policies in MOBA Games via Macro-Goals

  • Yiming Gao
  • Bei Shi
  • Xueying Du
  • Liang Wang
  • Guangwei Chen
  • Zhenjie Lian
  • Fuhao Qiu
  • GUOAN HAN

Recently, many researchers have made successful progress in building the AI systems for MOBA-game-playing with deep reinforcement learning, such as on Dota 2 and Honor of Kings. Even though these AI systems have achieved or even exceeded human-level performance, they still suffer from the lack of policy diversity. In this paper, we propose a novel Macro-Goals Guided framework, called MGG, to learn diverse policies in MOBA games. MGG abstracts strategies as macro-goals from human demonstrations and trains a Meta-Controller to predict these macro-goals. To enhance policy diversity, MGG samples macro-goals from the Meta-Controller prediction and guides the training process towards these goals. Experimental results on the typical MOBA game Honor of Kings demonstrate that MGG can execute diverse policies in different matches and lineups, and also outperform the state-of-the-art methods over 102 heroes.

IJCAI Conference 2021 Conference Paper

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

  • Menghui Zhu
  • Minghuan Liu
  • Jian Shen
  • Zhicheng Zhang
  • Sheng Chen
  • Weinan Zhang
  • Deheng Ye
  • Yong Yu

In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem. In this paper, to enhance the diversity of relabeled goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy that relabels the goals by looking into the future with a learned dynamics model. Besides, to improve sample efficiency, we propose to use the dynamics model to generate simulated trajectories for policy training. By integrating these two improvements, we introduce the MapGo framework (Model-Assisted Policy optimization for Goal-oriented tasks). In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the MapGo framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks.

AAAI Conference 2020 Conference Paper

Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Zhao Liu
  • Mingfei Sun
  • Bei Shi
  • Peilin Zhao
  • Hao Wu
  • Hongsheng Yu
  • Shaojie Yang

We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention, and dualclip PPO, with which our proposed actor-critic network can be effectively trained in our system. Tested on the MOBA game Honor of Kings, the trained AI agents can defeat top professional human players in full 1v1 games.

IJCAI Conference 2020 Conference Paper

Relation-Aware Transformer for Portfolio Policy Learning

  • Ke Xu
  • Yifan Zhang
  • Deheng Ye
  • Peilin Zhao
  • Mingkui Tan

Portfolio selection is an important yet challenging task in AI for FinTech. One of the key issues is how to represent the non-stationary price series of assets in a portfolio, which is important for portfolio decisions. The existing methods, however, fall short of capturing: 1) the complicated sequential patterns for asset price series and 2) the price correlations among multiple assets. In this paper, under a deep reinforcement learning paradigm for portfolio selection, we propose a novel Relation-aware Transformer (RAT) to handle these aspects. Specifically, being equipped with our newly developed attention modules, RAT is structurally innovated to capture both sequential patterns and asset correlations for portfolio selection. Based on the extracted sequential features, RAT is able to make profitable portfolio decisions regarding each asset via a newly devised leverage operation. Extensive experiments on real-world crypto-currency and stock datasets verify the state-of-the-art performance of RAT.

NeurIPS Conference 2020 Conference Paper

Towards Playing Full MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Guibin Chen
  • Wen Zhang
  • Sheng Chen
  • Bo Yuan
  • Bo Liu
  • Jia Chen
  • Zhao Liu

MOBA games, e. g. , Honor of Kings, League of Legends, and Dota 2, pose grand challenges to AI systems such as multi-agent, enormous state-action space, complex action control, etc. Developing AI for playing MOBA games has raised much attention accordingly. However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i. e. , lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes. As a result, full MOBA games without restrictions are far from being mastered by any existing AI system. In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. Specifically, we develop a combination of novel and existing learning techniques, including off-policy adaption, multi-head value estimation, curriculum self-play learning, policy distillation, and Monte-Carlo tree-search, in training and playing a large pool of heroes, meanwhile addressing the scalability issue skillfully. Tested on Honor of Kings, a popular MOBA game, we show how to build superhuman AI agents that can defeat top esports players. The superiority of our AI is demonstrated by the first large-scale performance test of MOBA AI agent in the literature.

IJCAI Conference 2019 Conference Paper

Adversarial Transfer for Named Entity Boundary Detection with Pointer Networks

  • Jing Li
  • Deheng Ye
  • Shuo Shang

In this paper, we focus on named entity boundary detection, which aims to detect the start and end boundaries of an entity mention in text, without predicting its type. A more accurate and robust detection approach is desired to alleviate error propagation in downstream applications, such as entity linking and fine-grained typing systems. Here, we first develop a novel entity boundary labeling approach with pointer networks, where the output dictionary size depends on the input, which is variable. Furthermore, we propose AT-Bdry, which incorporates adversarial transfer learning into an end-to-end sequence labeling model to encourage domain-invariant representations. More importantly, AT-Bdry can reduce domain difference in data distributions between the source and target domains, via an unsupervised transfer learning approach (i. e. , no annotated target-domain data is necessary). We conduct Formal Text to Formal Text, Formal Text to Informal Text and ablation evaluations on five benchmark datasets. Experimental results show that AT-Bdry achieves state-of-the-art transferring performance against recent baselines.