Arrow Research search

Author name cluster

Changjie Fan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers
2 author rows

Possible papers

49

AAAI Conference 2025 Conference Paper

DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue

  • Shichao Ma
  • Xinfeng Zhang
  • Zeng Zhao
  • Bai Liu
  • Changjie Fan
  • Zhipeng Hu

In recent years, diffusion modeling has shown great potential for image generation and editing. Beyond single-model approaches, various drawing workflows now exist to handle diverse drawing tasks. However, few solutions effectively identify user intentions through dialogue and progressively complete drawings. We introduce DialogDraw, which facilitates image generation and editing through continuous dialogue interaction. DialogDraw enables users to create and refine drawings using natural language and integrates with numerous open-source drawing workflows and models. The system accurately recognizes intentions and extracts user inputs via parameterization, adapts to various drawing function parameters, and provides an intuitive interaction mode. It effectively executes user instructions, supports dozens of image generation and editing methods, and offers robust scalability. Moreover, we employ SFT and RLHF to iterate the Intention Recognition and Parameter Extraction Model (IRPEM). To evaluate DialogDraw's functionality, we propose DrawnConvos, a dataset rich in drawing functions and command dialogue data collected from the open-source community. Our evaluation demonstrates that DialogDraw excels in command compliance, identifying and adapting to user drawing intentions, thereby proving the effectiveness of our method.

IROS Conference 2025 Conference Paper

High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics

  • Ziqing Zou
  • Cong Wang
  • Yue Hu
  • Xiao Liu
  • Bowen Xu
  • Rong Xiong
  • Changjie Fan
  • Yingfeng Chen

The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.

AAAI Conference 2025 Conference Paper

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

  • Mushui Liu
  • Yuhang Ma
  • Zhen Yang
  • Jun Dan
  • Yunlong Yu
  • Zeng Zhao
  • Zhipeng Hu
  • Bai Liu

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

ICLR Conference 2025 Conference Paper

Reinforcement Learning from Imperfect Corrective Actions and Proxy Rewards

  • Zhao-Hui Jiang
  • Xuening Feng
  • Paul Weng
  • Yifei Zhu
  • Yan Song
  • Tianze Zhou
  • Yujing Hu
  • Tangjie Lv

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in an undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called **I**terative learning from **Co**rrective actions and **Pro**xy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy rewards.

AAAI Conference 2025 Conference Paper

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

  • Yuhang Ma
  • Wenting Xu
  • Chaoyi Zhao
  • Keqiang Sun
  • Qinfeng Jin
  • Xiaoda Yang
  • Zeng Zhao
  • Changjie Fan

Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

AAMAS Conference 2024 Conference Paper

A Trajectory Perspective on the Role of Data Sampling Techniques in Offline Reinforcement Learning

  • Jinyi Liu
  • Yi Ma
  • Jianye Hao
  • Yujing Hu
  • Yan Zheng
  • Tangjie Lv
  • Changjie Fan

In recent years, offline reinforcement learning (RL) algorithms have gained considerable attention. However, the role of data sampling techniques in offline RL has been somewhat overlooked, despite their potential to enhance online RL performance. Recent research in offline RL indicates that applying sampling techniques directly to state-transitions does not consistently improve performance. Therefore, to better leverage limited offline trajectory data, we investigate the impact of data sampling processes on offline RL algorithms from a trajectory perspective. In this paper, we introduce a memory technique, (Prioritized) Trajectory Replay (TR/PTR), to facilitate trajectory data storage and sampling. Building on TR, we delve into the potential of trajectory backward sampling, a method that has already proven effective in online RL, in the offline RL domain. Furthermore, to improve the sampling efficiency, we examine the influence of prioritized sampling based on various trajectory priority metrics on offline training. Integrating with existing algorithms, our findings demonstrate that data sampling and updates based on vanilla TR can contribute to more stable training. Also, our proposed 13 trajectory priority metrics for PTR exhibit outstanding performance on their respective applicable types of dataset, with the best-case scenario resulting in performance improvements exceeding 25%. These performance gains are achieved at a slight extra cost during the data sampling process, highlighting the significant advantages of trajectory-based data sampling for offline RL. †Corresponding author: Yan Zheng (yanzheng@tju. edu. cn). ∗This work is done when Jinyi Liu was intern in NetEase Fuxi AI Lab. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

ICLR Conference 2024 Conference Paper

AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

  • Zibin Dong
  • Yifu Yuan
  • Jianye Hao
  • Fei Ni 0001
  • Yao Mu 0001
  • Yan Zheng 0002
  • Yujing Hu
  • Tangjie Lv

Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RLHF to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.

ICML Conference 2024 Conference Paper

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

  • Hao Hu 0006
  • Yiqin Yang
  • Jianing Ye
  • Chengjie Wu
  • Ziqing Mai
  • Yujing Hu
  • Tangjie Lv
  • Changjie Fan

Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.

AAAI Conference 2024 Conference Paper

EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization

  • Kai Wang
  • Haoyu Liu
  • Zhipeng Hu
  • Xiaochuan Feng
  • Minghao Zhao
  • Shiwei Zhao
  • Runze Wu
  • Xudong Shen

Matchmaking is a core task in e-sports and online games, as it contributes to player engagement and further influences the game's lifecycle. Previous methods focus on creating fair games at all times. They divide players into different tiers based on skill levels and only select players from the same tier for each game. Though this strategy can ensure fair matchmaking, it is not always good for player engagement. In this paper, we propose a novel Engagement-oriented Matchmaking (EnMatch) framework to ensure fair games and simultaneously enhance player engagement. Two main issues need to be addressed. First, it is unclear how to measure the impact of different team compositions and confrontations on player engagement during the game considering the variety of player characteristics. Second, such a detailed consideration on every single player during matchmaking will result in an NP-hard combinatorial optimization problem with non-linear objectives. In light of these challenges, we turn to real-world data analysis to reveal engagement-related factors. The resulting insights guide the development of engagement modeling, enabling the estimation of quantified engagement before a match is completed. To handle the combinatorial optimization problem, we formulate the problem into a reinforcement learning framework, in which a neural combinatorial optimization problem is built and solved. The performance of EnMatch is finally demonstrated through the comparison with other state-of-the-art methods based on several real-world datasets and online deployments on two games.

UAI Conference 2024 Conference Paper

Hybrid CtrlFormer: Learning Adaptive Search Space Partition for Hybrid Action Control via Transformer-based Monte Carlo Tree Search

  • Jiashun Liu
  • Xiaotian Hao
  • Jianye Hao
  • Yan Zheng 0002
  • Yujing Hu
  • Changjie Fan
  • Tangjie Lv
  • Zhipeng Hu

Hybrid action control tasks are common in the real world, which require controlling some discrete and continuous actions simultaneously. To solve these tasks, existing Deep Reinforcement learning (DRL) methods either directly build a separate policy for each type of action or simplify the hybrid action space into a discrete or continuous action control problem. However, these methods neglect the challenge of exploration resulting from the complexity of the hybrid action space. Thus, it is necessary to design more sample efficient algorithms. To this end, we propose a novel Hybrid Control Transformer (Hybrid CtrlFormer), to achieve better exploration and exploitation for the hybrid action control problems. The core idea is: 1) we construct a hybrid action space tree with the discrete actions at the higher level and the continuous parameter space at the lower level. Each parameter space is split into multiple subregions. 2) To simplify the exploration space, a Transformer-based Monte-Carlo tree search method is designed to efficiently evaluate and partition the hybrid action space into good and bad subregions along the tree. Our method achieves state-of-the-art performance and sample efficiency in a variety of environments with discrete-continuous action space.

AAMAS Conference 2024 Conference Paper

Mastering Robot Control through Point-based Reinforcement Learning with Pre-training

  • Yihong Chen
  • Cong Wang
  • Tianpei Yang
  • Meng Wang
  • Yingfeng Chen
  • Jifei Zhou
  • Chaoyi Zhao
  • Xinfeng Zhang

Visual-based Reinforcement Learning (RL) has gained prominence in robotics decision-making due to its significant potential. However, the prevalent utilization of images in visual-based RL lacks explicit descriptions of object structures and spatial configurations in scenes, thereby limiting the overall efficiency and robustness of RL in robot control. Additionally, training an RL policy solely using visual observations from scratch is typically sample-inefficient, rendering it impractical for real-world application. To address these challenges, this paper proposes a novel method, called Pre-training on Point-based RL (P2RL), which takes the point cloud representations of scenes as states and preserves the intricate spatial details between objects. To further enhance efficiency, we leverage the pre-training method to bolster the perception ability of the network. Key factors in the pre-training process are systematically examined to optimize downstream RL training. Experimental results demonstrate the superior robustness and efficiency of P2RL compared to the state-of-the-art image-based RL method, especially in evaluations involving untrained scenes.

AAAI Conference 2024 Conference Paper

Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

  • Chao Li
  • Yupeng Zhang
  • Jianqi Wang
  • Yujing Hu
  • Shaokang Dong
  • Wenbin Li
  • Tangjie Lv
  • Changjie Fan

In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.

IJCAI Conference 2024 Conference Paper

STAR: Spatio-Temporal State Compression for Multi-Agent Tasks with Rich Observations

  • Chao Li
  • Yujing Hu
  • Shangdong Yang
  • Tangjie Lv
  • Changjie Fan
  • Wenbin Li
  • Chongjie Zhang
  • Yang Gao

This paper focuses on the problem of learning compressed state representations for multi-agent tasks. Under the assumption of rich observation, we pinpoint that the state representations should be compressed both spatially and temporally to enable efficient prioritization of task-relevant features, while existing works typically fail. To overcome this limitation, we propose a novel method named Spatio-Temporal stAte compRession (STAR) that explicitly defines both spatial and temporal compression operations on the learned state representations to encode per-agent task-relevant features. Specifically, we first formalize this problem by introducing Task Informed Partially Observable Stochastic Game (TI-POSG). Then, we identify the spatial representation compression in it as encoding the latent states from the joint observations of all agents, and achieve this by learning representations that approximate the latent states based on the information theoretical principle. After that, we further extract the task-relevant features of each agent from these representations by aligning them based on their reward similarities, which is regarded as the temporal representation compression. Structurally, we implement these two compression by learning a set of agent-specific decoding functions and incorporate them into a critic shared by agents for scalable learning. We evaluate our method by developing decentralized policies on 12 maps of the StarCraft Multi-Agent Challenge benchmark, and the superior performance demonstrates its effectiveness.

ICLR Conference 2024 Conference Paper

Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasets

  • Yihuan Mao
  • Chengjie Wu
  • Xi Chen
  • Hao Hu 0006
  • Ji Jiang
  • Tianze Zhou
  • Tangjie Lv
  • Changjie Fan

Previous literature on policy diversity in reinforcement learning (RL) either focuses on the online setting or ignores the policy performance. In contrast, offline RL, which aims to learn high-quality policies from batched data, has yet to fully leverage the intrinsic diversity of the offline dataset. Addressing this dichotomy and aiming to balance quality and diversity poses a significant challenge to extant methodologies. This paper introduces a novel approach, termed Stylized Offline RL (SORL), which is designed to extract high-performing, stylistically diverse policies from a dataset characterized by distinct behavioral patterns. Drawing inspiration from the venerable Expectation-Maximization (EM) algorithm, SORL innovatively alternates between policy learning and trajectory clustering, a mechanism that promotes policy diversification. To further augment policy performance, we introduce advantage-weighted style learning into the SORL framework. Experimental evaluations across multiple environments demonstrate the significant superiority of SORL over previous methods in extracting high-quality policies with diverse behaviors. A case in point is that SORL successfully learns strong policies with markedly distinct playing patterns from a real-world human dataset of a popular basketball video game "Dunk City Dynasty."

IJCAI Conference 2024 Conference Paper

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement

  • Yiwen Zhu
  • Jinyi Liu
  • Wenya Wei
  • Qianyi Fu
  • Yujing Hu
  • Zhou Fang
  • Bo An
  • Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations -- policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

AAMAS Conference 2024 Conference Paper

vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms

  • Yiwen Zhu
  • Jinyi Liu
  • Wenya Wei
  • Qianyi Fu
  • Yujing Hu
  • Zhou Fang
  • Bo An
  • Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations – policy evaluation and policy improvement. Actor-critic algorithms dominate the field of RL, but there is a challenge in improving their learning efficiency. To address this, ensemble critics are often employed to enhance policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance the learning efficiency of actor-critic algorithms. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments on Mujoco robotic control tasks and robotic arm tasks with sparse rewards demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

AAMAS Conference 2023 Conference Paper

Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning

  • Shanqi Liu
  • Yujing Hu
  • Runze Wu
  • Dong Xing
  • Yu Xiong
  • Changjie Fan
  • Kun Kuang
  • Yong Liu

Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cuttingedge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. Moreover, applications in real-world scenarios usually require policies to be interpretable, but interpretability is limited in the implicit credit assignment methods. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. in several non-monotonic domains. Besides, we showcase that our model maintains a good sense of interpretability and rationality. This suggests our model can be applied to scenarios with more realistic demands.

NeurIPS Conference 2023 Conference Paper

Conservative Offline Policy Adaptation in Multi-Agent Games

  • Chengjie Wu
  • Pingzhong Tang
  • Jun Yang
  • Yujing Hu
  • Tangjie Lv
  • Changjie Fan
  • Chongjie Zhang

Prior research on policy adaptation in multi-agent games has often relied on online interaction with the target agent in training, which can be expensive and impractical in real-world scenarios. Inspired by recent progress in offline reinforcement learn- ing, this paper studies offline policy adaptation, which aims to utilize the target agent’s behavior data to exploit its weakness or enable effective cooperation. We investigate its distinct challenges of distributional shift and risk-free deviation, and propose a novel learning objective, conservative offline adaptation, that optimizes the worst-case performance against any dataset consistent proxy models. We pro- pose an efficient algorithm called Constrained Self-Play (CSP) that incorporates dataset information into regularized policy learning. We prove that CSP learns a near-optimal risk-free offline adaptation policy upon convergence. Empirical results demonstrate that CSP outperforms non-conservative baselines in various environments, including Maze, predator-prey, MuJoCo, and Google Football.

AAAI Conference 2023 Conference Paper

DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video

  • Zhimeng Zhang
  • Zhipeng Hu
  • Wenjin Deng
  • Changjie Fan
  • Tangjie Lv
  • Yu Ding

For few-shot learning, it is still a critical challenge to realize photo-realistic face visually dubbing on high-resolution videos. Previous works fail to generate high-fidelity dubbing results. To address the above problem, this paper proposes a Deformation Inpainting Network (DINet) for high-resolution face visually dubbing. Different from previous works relying on multiple up-sample layers to directly generate pixels from latent embeddings, DINet performs spatial deformation on feature maps of reference images to better preserve high-frequency textural details. Specifically, DINet consists of one deformation part and one inpainting part. In the first part, five reference facial images adaptively perform spatial deformation to create deformed feature maps encoding mouth shapes at each frame, in order to align with input driving audio and also the head poses of input source images. In the second part, to produce face visually dubbing, a feature decoder is responsible for adaptively incorporating mouth movements from the deformed feature maps and other attributes (i.e., head pose and upper facial expression) from the source feature maps together. Finally, DINet achieves face visually dubbing with rich textural details. We conduct qualitative and quantitative comparisons to validate our DINet on high-resolution videos. The experimental results show that our method outperforms state-of-the-art works.

ICLR Conference 2023 Conference Paper

EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model

  • Yifu Yuan
  • Jianye Hao
  • Fei Ni 0001
  • Yao Mu 0001
  • Yan Zheng 0002
  • Yujing Hu
  • Jinyi Liu 0002
  • Yingfeng Chen

Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG’s performance at 2M interactive steps with 20× more data. More visualization videos are released on our homepage.

AAAI Conference 2023 Conference Paper

FlowFace: Semantic Flow-Guided Shape-Aware Face Swapping

  • Hao Zeng
  • Wei Zhang
  • Changjie Fan
  • Tangjie Lv
  • Suzhen Wang
  • Zhimeng Zhang
  • Bowen Ma
  • Lincheng Li

In this work, we propose a semantic flow-guided two-stage framework for shape-aware face swapping, namely FlowFace. Unlike most previous methods that focus on transferring the source inner facial features but neglect facial contours, our FlowFace can transfer both of them to a target face, thus leading to more realistic face swapping. Concretely, our FlowFace consists of a face reshaping network and a face swapping network. The face reshaping network addresses the shape outline differences between the source and target faces. It first estimates a semantic flow (i.e. face shape differences) between the source and the target face, and then explicitly warps the target face shape with the estimated semantic flow. After reshaping, the face swapping network generates inner facial features that exhibit the identity of the source face. We employ a pre-trained face masked autoencoder (MAE) to extract facial features from both the source face and the target face. In contrast to previous methods that use identity embedding to preserve identity information, the features extracted by our encoder can better capture facial appearances and identity information. Then, we develop a cross-attention fusion module to adaptively fuse inner facial features from the source face with the target facial attributes, thus leading to better identity preservation. Extensive quantitative and qualitative experiments on in-the-wild faces demonstrate that our FlowFace outperforms the state-of-the-art significantly.

AAMAS Conference 2023 Conference Paper

Off-Beat Multi-Agent Reinforcement Learning

  • Wei Qiu
  • Weixun Wang
  • Rundong Wang
  • Bo An
  • Yujing Hu
  • Svetlana Obraztsova
  • Zinovi Rabinovich
  • Jianye Hao

We investigate cooperative multi-agent reinforcement learning in environments with off-beat actions, i. e. , all actions have execution durations. During execution durations, the environmental changes are not synchronised with action executions. To learn efficient multi-agent coordination in environments with off-beat actions, we propose a novel reward redistribution method built on our novel graph-based episodic memory. We name our solution method as LeGEM. Empirical results on stag-hunter game show that it significantly boosts multi-agent coordination.

AAAI Conference 2023 Conference Paper

StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles

  • Yifeng Ma
  • Suzhen Wang
  • Zhipeng Hu
  • Changjie Fan
  • Tangjie Lv
  • Yu Ding
  • Zhidong Deng
  • Xin Yu

Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.

ICML Conference 2022 Conference Paper

Deconfounded Value Decomposition for Multi-Agent Reinforcement Learning

  • Jiahui Li 0003
  • Kun Kuang 0001
  • Baoxiang Wang 0001
  • Furui Liu
  • Long Chen 0016
  • Changjie Fan
  • Fei Wu 0001
  • Jun Xiao 0001

Value decomposition (VD) methods have been widely used in cooperative multi-agent reinforcement learning (MARL), where credit assignment plays an important role in guiding the agents’ decentralized execution. In this paper, we investigate VD from a novel perspective of causal inference. We first show that the environment in existing VD methods is an unobserved confounder as the common cause factor of the global state and the joint value function, which leads to the confounding bias on learning credit assignment. We then present our approach, deconfounded value decomposition (DVD), which cuts off the backdoor confounding path from the global state to the joint value function. The cut is implemented by introducing the trajectory graph, which depends only on the local trajectories, as a proxy confounder. DVD is general enough to be applied to various VD methods, and extensive experiments show that DVD can consistently achieve significant performance gains over different state-of-the-art VD methods on StarCraft II and MACO benchmarks.

ICML Conference 2022 Conference Paper

Individual Reward Assisted Multi-Agent Reinforcement Learning

  • Li Wang
  • Yupeng Zhang
  • Yujing Hu
  • Weixun Wang
  • Chongjie Zhang
  • Yang Gao 0001
  • Jianye Hao
  • Tangjie Lv

In many real-world multi-agent systems, the sparsity of team rewards often makes it difficult for an algorithm to successfully learn a cooperative team policy. At present, the common way for solving this problem is to design some dense individual rewards for the agents to guide the cooperation. However, most existing works utilize individual rewards in ways that do not always promote teamwork and sometimes are even counterproductive. In this paper, we propose Individual Reward Assisted Team Policy Learning (IRAT), which learns two policies for each agent from the dense individual reward and the sparse team reward with discrepancy constraints for updating the two policies mutually. Experimental results in different scenarios, such as the Multi-Agent Particle Environment and the Google Research Football Environment, show that IRAT significantly outperforms the baseline methods and can greatly promote team policy learning without deviating from the original team objective, even when the individual rewards are misleading or conflict with the team rewards.

AAAI Conference 2022 Conference Paper

Multi-Dimensional Prediction of Guild Health in Online Games: A Stability-Aware Multi-Task Learning Approach

  • Chuang Zhao
  • Hongke Zhao
  • Runze Wu
  • Qilin Deng
  • Yu Ding
  • Jianrong Tao
  • Changjie Fan

Guild is the most important long-term virtual community and emotional bond in massively multiplayer online roleplaying games (MMORPGs). It matters a lot to the player retention and game ecology how the guilds are going, e. g. , healthy or not. The main challenge now is to characterize and predict the guild health in a quantitative, dynamic, and multi-dimensional manner based on complicated multimedia data streams. To this end, we propose a novel framework, namely Stability-Aware Multi-task Learning Approach (SAMLA) to address these challenges. Specifically, different media-specific modules are designed to extract information from multiple media types of tabular data, time series characteristics, and heterogeneous graphs. To capture the dynamics of guild health, we introduce a representation encoder to provide a time-series view of multi-media data that is used for task prediction. Inspired by well-received theories on organization management, we delicately define five specific and quantitative dimensions of guild health and make parallel predictions based on a multi-task approach. Besides, we devise a novel auxiliary task, i. e. , the guild stability, to boost the performance of the guild health prediction task. Extensive experiments on a real-world large-scale MMORPG dataset verify that our proposed method outperforms the state-of-the-art methods in the task of organizational health characterization and prediction. Moreover, our work has been practically deployed in online MMORPG, and case studies clearly illustrate the significant value.

NeurIPS Conference 2021 Conference Paper

An Efficient Transfer Learning Framework for Multiagent Reinforcement Learning

  • Tianpei Yang
  • Weixun Wang
  • Hongyao Tang
  • Jianye Hao
  • Zhaopeng Meng
  • Hangyu Mao
  • Dong Li
  • Wulong Liu

Transfer Learning has shown great potential to enhance single-agent Reinforcement Learning (RL) efficiency. Similarly, Multiagent RL (MARL) can also be accelerated if agents can share knowledge with each other. However, it remains a problem of how an agent should learn from other agents. In this paper, we propose a novel Multiagent Policy Transfer Framework (MAPTF) to improve MARL efficiency. MAPTF learns which agent's policy is the best to reuse for each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. Furthermore, in practice, the option module can only collect all agent's local experiences for update due to the partial observability of the environment. While in this setting, each agent's experience may be inconsistent with each other, which may cause the inaccuracy and oscillation of the option-value's estimation. Therefore, we propose a novel option learning algorithm, the successor representation option learning to solve it by decoupling the environment dynamics from rewards and learning the option-value under each agent's preference. MAPTF can be easily combined with existing deep RL and MARL approaches, and experimental results show it significantly boosts the performance of existing methods in both discrete and continuous state spaces.

IJCAI Conference 2021 Conference Paper

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

  • Suzhen Wang
  • Lincheng Li
  • Yu Ding
  • Changjie Fan
  • Xin Yu

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii)} maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

IJCAI Conference 2021 Conference Paper

Automatic Translation of Music-to-Dance for In-Game Characters

  • Yinglin Duan
  • Tianyang Shi
  • Zhipeng Hu
  • Zhengxia Zou
  • Changjie Fan
  • Yi Yuan
  • Xi Li

Music-to-dance translation is an emerging and powerful feature in recent role-playing games. Previous works of this topic consider music-to-dance as a supervised motion generation problem based on time-series data. However, these methods require a large amount of training data pairs and may suffer from the degradation of movements. This paper provides a new solution to this task where we re-formulate the translation as a piece-wise dance phrase retrieval problem based on the choreography theory. With such a design, players are allowed to optionally edit the dance movements on top of our generation while other regression-based methods ignore such user interactivity. Considering that the dance motion capture is expensive that requires the assistance of professional dancers, we train our method under a semi-supervised learning fashion with a large unlabeled music dataset (20x than our labeled one) and also introduce self-supervised pre-training to improve the training stability and generalization performance. Experimental results suggest that our method not only generalizes well over various styles of music but also succeeds in choreography for game players. Our project including the large-scale dataset and supplemental materials is available at https: //github. com/FuxiCV/music-to-dance.

NeurIPS Conference 2021 Conference Paper

Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration

  • Lulu Zheng
  • Jiarui Chen
  • Jianhao Wang
  • Jiamin He
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Yang Gao

Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the ``induced" individual Q-values, i. e. , the individual utility functions used for local execution, are the embeddings of local action-observation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.

ICML Conference 2021 Conference Paper

MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration

  • Jin Zhang 0016
  • Jianhao Wang
  • Hao Hu 0006
  • Tong Chen
  • Yingfeng Chen
  • Changjie Fan
  • Chongjie Zhang

Meta reinforcement learning (meta-RL) extracts knowledge from previous tasks and achieves fast adaptation to new tasks. Despite recent progress, efficient exploration in meta-RL remains a key challenge in sparse-reward tasks, as it requires quickly finding informative task-relevant experiences in both meta-training and adaptation. To address this challenge, we explicitly model an exploration policy learning problem for meta-RL, which is separated from exploitation policy learning, and introduce a novel empowerment-driven exploration objective, which aims to maximize information gain for task identification. We derive a corresponding intrinsic reward and develop a new off-policy meta-RL framework, which efficiently learns separate context-aware exploration and exploitation policies by sharing the knowledge of task inference. Experimental evaluation shows that our meta-RL method significantly outperforms state-of-the-art baselines on various sparse-reward MuJoCo locomotion tasks and more complex sparse-reward Meta-World tasks.

AAAI Conference 2021 Conference Paper

Reinforcement Learning with a Disentangled Universal Value Function for Item Recommendation

  • Kai Wang
  • Zhene Zou
  • Qilin Deng
  • Jianrong Tao
  • Runze Wu
  • Changjie Fan
  • Liang Chen
  • Peng Cui

In recent years, there are great interests as well as challenges in applying reinforcement learning (RL) to recommendation systems (RS). In this paper, we summarize three key practical challenges of large-scale RL-based recommender systems: massive state and action spaces, high-variance environment, and the unspecific reward setting in recommendation. All these problems remain largely unexplored in the existing literature and make the application of RL challenging. We develop a model-based reinforcement learning framework, called GoalRec. Inspired by the ideas of world model (model-based), value function estimation (model-free), and goal-based RL, a novel disentangled universal value function designed for item recommendation is proposed. It can generalize to various goals that the recommender may have, and disentangle the stochastic environmental dynamics and high-variance reward signals accordingly. As a part of the value function, free from the sparse and high-variance reward signals, a high-capacity reward-independent world model is trained to simulate complex environmental dynamics under a certain goal. Based on the predicted environmental dynamics, the disentangled universal value function is related to the user’s future trajectory instead of a monolithic state and a scalar reward. We demonstrate the superiority of GoalRec over previous approaches in terms of the above three practical challenges in a series of simulations and a real application.

NeurIPS Conference 2021 Conference Paper

Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

  • Xiangyu Liu
  • Hangtian Jia
  • Ying Wen
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Zhipeng Hu
  • Yaodong Yang

Measuring and promoting policy diversity is critical for solving games with strong non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e. g. , Rock-Paper-Scissors). With that in mind, maintaining a pool of diverse policies via open-ended learning is an attractive solution, which can generate auto-curricula to avoid being exploited. However, in conventional open-ended learning algorithms, there are no widely accepted definitions for diversity, making it hard to construct and evaluate the diverse policies. In this work, we summarize previous concepts of diversity and work towards offering a unified measure of diversity in multi-agent open-ended learning to include all elements in Markov games, based on both Behavioral Diversity (BD) and Response Diversity (RD). At the trajectory distribution level, we re-define BD in the state-action space as the discrepancies of occupancy measures. For the reward dynamics, we propose RD to characterize diversity through the responses of policies when encountering different opponents. We also show that many current diversity measures fall in one of the categories of BD or RD but not both. With this unified diversity measure, we design the corresponding diversity-promoting objective and population effectivity when seeking the best responses in open-ended learning. We validate our methods in both relatively simple games like matrix game, non-transitive mixture model, and the complex \textit{Google Research Football} environment. The population found by our methods reveals the lowest exploitability, highest population effectivity in matrix game and non-transitive mixture model, as well as the largest goal difference when interacting with opponents of various levels in \textit{Google Research Football}.

AAAI Conference 2021 Conference Paper

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

  • Lincheng Li
  • Suzhen Wang
  • Zhimeng Zhang
  • Yu Ding
  • Yixing Zheng
  • Xin Yu
  • Changjie Fan

In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i. e. , facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photorealistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.

ICLR Conference 2020 Conference Paper

Action Semantics Network: Considering the Effects of Actions in Multiagent Systems

  • Weixun Wang
  • Tianpei Yang
  • Yong Liu 0007
  • Jianye Hao
  • Xiaotian Hao
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan

In multiagent systems (MASs), each agent makes individual decisions but all of them contribute globally to the system evolution. Learning in MASs is difficult since each agent's selection of actions must take place in the presence of other co-learning agents. Moreover, the environmental stochasticity and uncertainties increase exponentially with the increase in the number of agents. Previous works borrow various multiagent coordination mechanisms into deep learning architecture to facilitate multiagent coordination. However, none of them explicitly consider action semantics between agents that different actions have different influences on other agents. In this paper, we propose a novel network architecture, named Action Semantics Network (ASN), that explicitly represents such action semantics between agents. ASN characterizes different actions' influence on other agents using neural networks based on the action semantics between them. ASN can be easily combined with existing deep reinforcement learning (DRL) algorithms to boost their performance. Experimental results on StarCraft II micromanagement and Neural MMO show ASN significantly improves the performance of state-of-the-art DRL approaches compared with several network architectures.

IJCAI Conference 2020 Conference Paper

Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

  • Tianpei Yang
  • Jianye Hao
  • Zhaopeng Meng
  • Zongzhang Zhang
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Weixun Wang

Transfer learning has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing approaches either transfer previous knowledge by explicitly computing similarities between tasks or select appropriate source policies to provide guided explorations. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarities is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) by taking advantage of this idea. PTF learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as an option learning problem. PTF can be easily combined with existing DRL methods and experimental results show it significantly accelerates RL and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.

JAAMAS Journal 2020 Journal Article

Efficient policy detecting and reusing for non-stationarity in Markov games

  • Yan Zheng
  • Jianye Hao
  • Changjie Fan

Abstract One challenging problem in multiagent systems is to cooperate or compete with non-stationary agents that change behavior from time to time. An agent in such a non-stationary environment is usually supposed to be able to quickly detect the other agents’ policy during online interaction, and then adapt its own policy accordingly. This article studies efficient policy detecting and reusing techniques when playing against non-stationary agents in cooperative or competitive Markov games. We propose a new deep Bayesian policy reuse algorithm, a. k. a. DPN-BPR+, by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the rectified belief model taking advantage of the opponent model to infer the other agents’ policy from reward signals and its behavior. Instead of directly storing individual policies as BPR+, we introduce distilled policy network that serves as the policy library, and policy distillation to achieve efficient online policy learning and reuse. DPN-BPR+ inherits all the advantages of BPR+. In experiments, we evaluate DPN-BPR+ in terms of detection accuracy, cumulative reward and speed of convergence in four complex Markov games with raw visual inputs, including two cooperative games and two competitive games. Empirical results show that our proposed DPN-BPR+ approach has better performance than existing algorithms in all these Markov games.

AAAI Conference 2020 Conference Paper

Fast and Robust Face-to-Parameter Translation for Game Character Auto-Creation

  • Tianyang Shi
  • Zhengxia Zuo
  • Yi Yuan
  • Changjie Fan

With the rapid development of Role-Playing Games (RPGs), players are now allowed to edit the facial appearance of their in-game characters with their preferences rather than using default templates. This paper proposes a game character autocreation framework that generates in-game characters according to a player’s input face photo. Different from the previous methods that are designed based on neural style transfer or monocular 3D face reconstruction, we re-formulate the character auto-creation process in a different point of view: by predicting a large set of physically meaningful facial parameters under a self-supervised learning paradigm. Instead of updating facial parameters iteratively at the input end of the renderer as suggested by previous methods, which are timeconsuming, we introduce a facial parameter translator so that the creation can be done efficiently through a single forward propagation from the face embeddings to parameters, with a considerable 1000x computational speedup. Despite its high efficiency, the interactivity is preserved in our method where users are allowed to optionally fine-tune the facial parameters on our creation according to their needs. Our approach also shows better robustness than previous methods, especially for those photos with head-pose variance. Comparison results and ablation analysis on seven public face verification datasets suggest the effectiveness of our method.

AAAI Conference 2020 Conference Paper

From Few to More: Large-Scale Dynamic Multiagent Curriculum Learning

  • Weixun Wang
  • Tianpei Yang
  • Yong Liu
  • Jianye Hao
  • Xiaotian Hao
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan

A lot of efforts have been devoted to investigating how agents can learn effectively and achieve coordination in multiagent systems. However, it is still challenging in large-scale multiagent settings due to the complex dynamics between the environment and agents and the explosion of state-action space. In this paper, we design a novel Dynamic Multiagent Curriculum Learning (DyMA-CL) to solve large-scale problems by starting from learning on a multiagent scenario with a small size and progressively increasing the number of agents. We propose three transfer mechanisms across curricula to accelerate the learning process. Moreover, due to the fact that the state dimension varies across curricula, and existing network structures cannot be applied in such a transfer setting since their network input sizes are fixed. Therefore, we design a novel network structure called Dynamic Agent-number Network (DyAN) to handle the dynamic size of the network input. Experimental results show that DyMA-CL using DyAN greatly improves the performance of large-scale multiagent learning compared with state-of-the-art deep reinforcement learning approaches. We also investigate the influence of three transfer mechanisms across curricula through extensive simulations.

IJCAI Conference 2020 Conference Paper

Generating Behavior-Diverse Game AIs with Evolutionary Multi-Objective Deep Reinforcement Learning

  • Ruimin Shen
  • Yan Zheng
  • Jianye Hao
  • Zhaopeng Meng
  • Yingfeng Chen
  • Changjie Fan
  • Yang Liu

Generating diverse behaviors for game artificial intelligence (Game AI) has been long recognized as a challenging task in the game industry. Designing a Game AI with a satisfying behavioral characteristic (style) heavily depends on the domain knowledge and is hard to achieve manually. Deep reinforcement learning sheds light on advancing the automatic Game AI design. However, most of them focus on creating a superhuman Game AI, ignoring the importance of behavioral diversity in games. To bridge the gap, we introduce a new framework, named EMOGI, which can automatically generate desirable styles with almost no domain knowledge. More importantly, EMOGI succeeds in creating a range of diverse styles, providing behavior-diverse Game AIs. Evaluations on the Atari and real commercial games indicate that, compared to existing algorithms, EMOGI performs better in generating diverse behaviors and significantly improves the efficiency of Game AI design.

AAAI Conference 2020 Short Paper

Generative Adversarial Imitation Learning from Failed Experiences (Student Abstract)

  • Jiacheng Zhu
  • Jiahao Lin
  • Meng Wang
  • Yingfeng Chen
  • Changjie Fan
  • Chong Jiang
  • Zongzhang Zhang

Imitation learning provides a family of promising methods that learn policies from expert demonstrations directly. As a model-free and on-line imitation learning method, generative adversarial imitation learning (GAIL) generalizes well to unseen situations and can handle complex problems. In this paper, we propose a novel variant of GAIL called GAIL from failed experiences (GAILFE). GAILFE allows an agent to utilize failed experiences in the training process. Moreover, a constrained optimization objective is formalized in GAILFE to balance learning from given demonstrations and from self-generated failed experiences. Empirically, compared with GAIL, GAILFE can improve sample efficiency and learning speed over different tasks.

NeurIPS Conference 2020 Conference Paper

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

  • Yujing Hu
  • Weixun Wang
  • Hangtian Jia
  • Yixiang Wang
  • Yingfeng Chen
  • Jianye Hao
  • Feng Wu
  • Changjie Fan

Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.

ICML Conference 2020 Conference Paper

Q-value Path Decomposition for Deep Multiagent Reinforcement Learning

  • Yaodong Yang 0002
  • Jianye Hao
  • Guangyong Chen
  • Hongyao Tang
  • Yingfeng Chen
  • Yujing Hu
  • Changjie Fan
  • Zhongyu Wei

Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm and during centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level’s benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system’s global Q-values into individual agents’ Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.

IJCAI Conference 2019 Conference Paper

Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces

  • Haotian Fu
  • Hongyao Tang
  • Jianye Hao
  • Zihan Lei
  • Yingfeng Chen
  • Changjie Fan

Deep Reinforcement Learning (DRL) has been applied to address a variety of cooperative multi-agent problems with either discrete action spaces or continuous action spaces. However, to the best of our knowledge, no previous work has ever succeeded in applying DRL to multi-agent problems with discrete-continuous hybrid (or parameterized) action spaces which is very common in practice. Our work fills this gap by proposing two novel algorithms: Deep Multi-Agent Parameterized Q-Networks (Deep MAPQN) and Deep Multi-Agent Hierarchical Hybrid Q-Networks (Deep MAHHQN). We follow the centralized training but decentralized execution paradigm: different levels of communication between different agents are used to facilitate the training process, while each agent executes its policy independently based on local observations during execution. Our empirical results on several challenging tasks (simulated RoboCup Soccer and game Ghost Story) show that both Deep MAPQN and Deep MAHHQN are effective and significantly outperform existing independent deep parameterized Q-learning method.

IJCAI Conference 2019 Conference Paper

Explicitly Coordinated Policy Iteration

  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Jianye Hao

Coordination on an optimal policy between independent learners in fully cooperative stochastic games is difficult due to problems such as relative overgeneralization and miscoordination. Most state-of-the-art algorithms apply fusion heuristics on agents' optimistic and average rewards, by which coordination between agents can be achieved implicitly. However, such implicit coordination faces practical issues such as tedious parameter-tuning in real world applications. The lack of an explicit coordination mechanism may also lead to a low likelihood of coordination in problems with multiple optimal policies. Based on the necessary conditions of an optimal policy, we propose the explicitly coordinated policy iteration (EXCEL) algorithm which always forces agents to coordinate by comparing the agents' separated optimistic and average value functions. We also propose three solutions for deep reinforcement learning extensions of EXCEL. Extensive experiments in matrix games (from 2-agent 2-action games to 5-agent 20-action games) and stochastic games (from 2-agent games to 5-agent games) show that EXCEL has better performance than the state-of-the-art algorithms (such as faster convergence and better coordination).

IJCAI Conference 2019 Conference Paper

Reinforcement Learning Experience Reuse with Policy Residual Representation

  • WenJi Zhou
  • Yang Yu
  • Yingfeng Chen
  • Kai Guan
  • Tangjie Lv
  • Changjie Fan
  • Zhi-Hua Zhou

Experience reuse is key to sample-efficient reinforcement learning. One of the critical issues is how the experience is represented and stored. Previously, the experience can be stored in the forms of features, individual models, and the average model, each lying at a different granularity. However, new tasks may require experience across multiple granularities. In this paper, we propose the policy residual representation (PRR) network, which can extract and store multiple levels of experience. PRR network is trained on a set of tasks with a multi-level architecture, where a module in each level corresponds to a subset of the tasks. Therefore, the PRR network represents the experience in a spectrum-like way. When training on a new task, PRR can provide different levels of experience for accelerating the learning. We experiment with the PRR network on a set of grid world navigation tasks, locomotion tasks, and fighting tasks in a video game. The results show that the PRR network leads to better reuse of experience and thus outperforms some state-of-the-art approaches.

IJCAI Conference 2019 Conference Paper

Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns

  • Yong Liu
  • Yujing Hu
  • Yang Gao
  • Yingfeng Chen
  • Changjie Fan

Many real-world problems, such as robot control and soccer game, are naturally modeled as sparse-interaction multi-agent systems. Reutilizing single-agent knowledge in multi-agent systems with sparse interactions can greatly accelerate the multi-agent learning process. Previous works rely on bisimulation metric to define Markov decision process (MDP) similarity for controlling knowledge transfer. However, bisimulation metric is costly to compute and is not suitable for high-dimensional state space problems. In this work, we propose more scalable transfer learning methods based on a novel MDP similarity concept. We start by defining the MDP similarity based on the N-step return (NSR) values of an MDP. Then, we propose two knowledge transfer methods based on deep neural networks called direct value function transfer and NSR-based value function transfer. We conduct experiments in image-based grid world, multi-agent particle environment (MPE) and Ms. Pac-Man game. The results indicate that the proposed methods can significantly accelerate multi-agent reinforcement learning and meanwhile get better asymptotic performance.

NeurIPS Conference 2018 Conference Paper

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

  • Yan Zheng
  • Zhaopeng Meng
  • Jianye Hao
  • Zongzhang Zhang
  • Tianpei Yang
  • Changjie Fan

In multiagent domains, coping with non-stationary agents that change behaviors from time to time is a challenging problem, where an agent is usually required to be able to quickly detect the other agent's policy during online interaction, and then adapt its own policy accordingly. This paper studies efficient policy detecting and reusing techniques when playing against non-stationary agents in Markov games. We propose a new deep BPR+ algorithm by extending the recent BPR+ algorithm with a neural network as the value-function approximator. To detect policy accurately, we propose the \textit{rectified belief model} taking advantage of the \textit{opponent model} to infer the other agent's policy from reward signals and its behaviors. Instead of directly storing individual policies as BPR+, we introduce \textit{distilled policy network} that serves as the policy library in BPR+, using policy distillation to achieve efficient online policy learning and reuse. Deep BPR+ inherits all the advantages of BPR+ and empirically shows better performance in terms of detection accuracy, cumulative rewards and speed of convergence compared to existing algorithms in complex Markov games with raw visual inputs.

IJCAI Conference 2018 Conference Paper

Recurrent Deep Multiagent Q-Learning for Autonomous Brokers in Smart Grid

  • Yaodong Yang
  • Jianye Hao
  • Mingyang Sun
  • Zan Wang
  • Changjie Fan
  • Goran Strbac

The broker mechanism is widely applied to serve for interested parties to derive long-term policies in order to reduce costs or gain profits in smart grid. However, a broker is faced with a number of challenging problems such as balancing demand and supply from customers and competing with other coexisting brokers to maximize its profit. In this paper, we develop an effective pricing strategy for brokers in local electricity retail market based on recurrent deep multiagent reinforcement learning and sequential clustering. We use real household electricity consumption data to simulate the retail market for evaluating our strategy. The experiments demonstrate the superior performance of the proposed pricing strategy and highlight the effectiveness of our reward shaping mechanism.