Author name cluster

Changjie Fan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

49 papers

2 author rows

AAAI Conference 2025 Conference Paper

DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue

Shichao Ma
Xinfeng Zhang
Zeng Zhao
Bai Liu
Changjie Fan
Zhipeng Hu

In recent years, diffusion modeling has shown great potential for image generation and editing. Beyond single-model approaches, various drawing workflows now exist to handle diverse drawing tasks. However, few solutions effectively identify user intentions through dialogue and progressively complete drawings. We introduce DialogDraw, which facilitates image generation and editing through continuous dialogue interaction. DialogDraw enables users to create and refine drawings using natural language and integrates with numerous open-source drawing workflows and models. The system accurately recognizes intentions and extracts user inputs via parameterization, adapts to various drawing function parameters, and provides an intuitive interaction mode. It effectively executes user instructions, supports dozens of image generation and editing methods, and offers robust scalability. Moreover, we employ SFT and RLHF to iterate the Intention Recognition and Parameter Extraction Model (IRPEM). To evaluate DialogDraw's functionality, we propose DrawnConvos, a dataset rich in drawing functions and command dialogue data collected from the open-source community. Our evaluation demonstrates that DialogDraw excels in command compliance, identifying and adapting to user drawing intentions, thereby proving the effectiveness of our method.

PDF Details DOI

IROS Conference 2025 Conference Paper

High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics

Ziqing Zou
Cong Wang
Yue Hu
Xiao Liu
Bowen Xu
Rong Xiong
Changjie Fan
Yingfeng Chen

The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.

Details

AAAI Conference 2025 Conference Paper

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Mushui Liu
Yuhang Ma
Zhen Yang
Jun Dan
Yunlong Yu
Zeng Zhao
Zhipeng Hu
Bai Liu

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Reinforcement Learning from Imperfect Corrective Actions and Proxy Rewards

Zhao-Hui Jiang
Xuening Feng
Paul Weng
Yifei Zhu
Yan Song
Tianze Zhou
Yujing Hu
Tangjie Lv

In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in an undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler's action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm called **I**terative learning from **Co**rrective actions and **Pro**xy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent's demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler's preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy rewards.

Details

AAAI Conference 2025 Conference Paper

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

Yuhang Ma
Wenting Xu
Chaoyi Zhao
Keqiang Sun
Qinfeng Jin
Xiaoda Yang
Zeng Zhao
Changjie Fan

Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

A Trajectory Perspective on the Role of Data Sampling Techniques in Offline Reinforcement Learning

Jinyi Liu
Yi Ma
Jianye Hao
Yujing Hu
Yan Zheng
Tangjie Lv
Changjie Fan

In recent years, offline reinforcement learning (RL) algorithms have gained considerable attention. However, the role of data sampling techniques in offline RL has been somewhat overlooked, despite their potential to enhance online RL performance. Recent research in offline RL indicates that applying sampling techniques directly to state-transitions does not consistently improve performance. Therefore, to better leverage limited offline trajectory data, we investigate the impact of data sampling processes on offline RL algorithms from a trajectory perspective. In this paper, we introduce a memory technique, (Prioritized) Trajectory Replay (TR/PTR), to facilitate trajectory data storage and sampling. Building on TR, we delve into the potential of trajectory backward sampling, a method that has already proven effective in online RL, in the offline RL domain. Furthermore, to improve the sampling efficiency, we examine the influence of prioritized sampling based on various trajectory priority metrics on offline training. Integrating with existing algorithms, our findings demonstrate that data sampling and updates based on vanilla TR can contribute to more stable training. Also, our proposed 13 trajectory priority metrics for PTR exhibit outstanding performance on their respective applicable types of dataset, with the best-case scenario resulting in performance improvements exceeding 25%. These performance gains are achieved at a slight extra cost during the data sampling process, highlighting the significant advantages of trajectory-based data sampling for offline RL. †Corresponding author: Yan Zheng (yanzheng@tju. edu. cn). ∗This work is done when Jinyi Liu was intern in NetEase Fuxi AI Lab. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

PDF

ICLR Conference 2024 Conference Paper

AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model

Zibin Dong
Yifu Yuan
Jianye Hao
Fei Ni 0001
Yao Mu 0001
Yan Zheng 0002
Yujing Hu
Tangjie Lv

Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RLHF to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/.

Details

ICML Conference 2024 Conference Paper

Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Hao Hu 0006
Yiqin Yang
Jianing Ye
Chengjie Wu
Ziqing Mai
Yujing Hu
Tangjie Lv
Changjie Fan

Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.

Details

AAAI Conference 2024 Conference Paper

EnMatch: Matchmaking for Better Player Engagement via Neural Combinatorial Optimization

Kai Wang
Haoyu Liu
Zhipeng Hu
Xiaochuan Feng
Minghao Zhao
Shiwei Zhao
Runze Wu
Xudong Shen

Matchmaking is a core task in e-sports and online games, as it contributes to player engagement and further influences the game's lifecycle. Previous methods focus on creating fair games at all times. They divide players into different tiers based on skill levels and only select players from the same tier for each game. Though this strategy can ensure fair matchmaking, it is not always good for player engagement. In this paper, we propose a novel Engagement-oriented Matchmaking (EnMatch) framework to ensure fair games and simultaneously enhance player engagement. Two main issues need to be addressed. First, it is unclear how to measure the impact of different team compositions and confrontations on player engagement during the game considering the variety of player characteristics. Second, such a detailed consideration on every single player during matchmaking will result in an NP-hard combinatorial optimization problem with non-linear objectives. In light of these challenges, we turn to real-world data analysis to reveal engagement-related factors. The resulting insights guide the development of engagement modeling, enabling the estimation of quantified engagement before a match is completed. To handle the combinatorial optimization problem, we formulate the problem into a reinforcement learning framework, in which a neural combinatorial optimization problem is built and solved. The performance of EnMatch is finally demonstrated through the comparison with other state-of-the-art methods based on several real-world datasets and online deployments on two games.

PDF Details DOI

UAI Conference 2024 Conference Paper

Hybrid CtrlFormer: Learning Adaptive Search Space Partition for Hybrid Action Control via Transformer-based Monte Carlo Tree Search

Jiashun Liu
Xiaotian Hao
Jianye Hao
Yan Zheng 0002
Yujing Hu
Changjie Fan
Tangjie Lv
Zhipeng Hu

Hybrid action control tasks are common in the real world, which require controlling some discrete and continuous actions simultaneously. To solve these tasks, existing Deep Reinforcement learning (DRL) methods either directly build a separate policy for each type of action or simplify the hybrid action space into a discrete or continuous action control problem. However, these methods neglect the challenge of exploration resulting from the complexity of the hybrid action space. Thus, it is necessary to design more sample efficient algorithms. To this end, we propose a novel Hybrid Control Transformer (Hybrid CtrlFormer), to achieve better exploration and exploitation for the hybrid action control problems. The core idea is: 1) we construct a hybrid action space tree with the discrete actions at the higher level and the continuous parameter space at the lower level. Each parameter space is split into multiple subregions. 2) To simplify the exploration space, a Transformer-based Monte-Carlo tree search method is designed to efficiently evaluate and partition the hybrid action space into good and bad subregions along the tree. Our method achieves state-of-the-art performance and sample efficiency in a variety of environments with discrete-continuous action space.

Details

AAMAS Conference 2024 Conference Paper

Mastering Robot Control through Point-based Reinforcement Learning with Pre-training

Yihong Chen
Cong Wang
Tianpei Yang
Meng Wang
Yingfeng Chen
Jifei Zhou
Chaoyi Zhao
Xinfeng Zhang

Visual-based Reinforcement Learning (RL) has gained prominence in robotics decision-making due to its significant potential. However, the prevalent utilization of images in visual-based RL lacks explicit descriptions of object structures and spatial configurations in scenes, thereby limiting the overall efficiency and robustness of RL in robot control. Additionally, training an RL policy solely using visual observations from scratch is typically sample-inefficient, rendering it impractical for real-world application. To address these challenges, this paper proposes a novel method, called Pre-training on Point-based RL (P2RL), which takes the point cloud representations of scenes as states and preserves the intricate spatial details between objects. To further enhance efficiency, we leverage the pre-training method to bolster the perception ability of the network. Key factors in the pre-training process are systematically examined to optimize downstream RL training. Experimental results demonstrate the superior robustness and efficiency of P2RL compared to the state-of-the-art image-based RL method, especially in evaluations involving untrained scenes.

PDF

AAAI Conference 2024 Conference Paper

Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning

Chao Li
Yupeng Zhang
Jianqi Wang
Yujing Hu
Shaokang Dong
Wenbin Li
Tangjie Lv
Changjie Fan

In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

STAR: Spatio-Temporal State Compression for Multi-Agent Tasks with Rich Observations

Chao Li
Yujing Hu
Shangdong Yang
Tangjie Lv
Changjie Fan
Wenbin Li
Chongjie Zhang
Yang Gao

This paper focuses on the problem of learning compressed state representations for multi-agent tasks. Under the assumption of rich observation, we pinpoint that the state representations should be compressed both spatially and temporally to enable efficient prioritization of task-relevant features, while existing works typically fail. To overcome this limitation, we propose a novel method named Spatio-Temporal stAte compRession (STAR) that explicitly defines both spatial and temporal compression operations on the learned state representations to encode per-agent task-relevant features. Specifically, we first formalize this problem by introducing Task Informed Partially Observable Stochastic Game (TI-POSG). Then, we identify the spatial representation compression in it as encoding the latent states from the joint observations of all agents, and achieve this by learning representations that approximate the latent states based on the information theoretical principle. After that, we further extract the task-relevant features of each agent from these representations by aligning them based on their reward similarities, which is regarded as the temporal representation compression. Structurally, we implement these two compression by learning a set of agent-specific decoding functions and incorporate them into a critic shared by agents for scalable learning. We evaluate our method by developing decentralized policies on 12 maps of the StarCraft Multi-Agent Challenge benchmark, and the superior performance demonstrates its effectiveness.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasets

Yihuan Mao
Chengjie Wu
Xi Chen
Hao Hu 0006
Ji Jiang
Tianze Zhou
Tangjie Lv
Changjie Fan

Previous literature on policy diversity in reinforcement learning (RL) either focuses on the online setting or ignores the policy performance. In contrast, offline RL, which aims to learn high-quality policies from batched data, has yet to fully leverage the intrinsic diversity of the offline dataset. Addressing this dichotomy and aiming to balance quality and diversity poses a significant challenge to extant methodologies. This paper introduces a novel approach, termed Stylized Offline RL (SORL), which is designed to extract high-performing, stylistically diverse policies from a dataset characterized by distinct behavioral patterns. Drawing inspiration from the venerable Expectation-Maximization (EM) algorithm, SORL innovatively alternates between policy learning and trajectory clustering, a mechanism that promotes policy diversification. To further augment policy performance, we introduce advantage-weighted style learning into the SORL framework. Experimental evaluations across multiple environments demonstrate the significant superiority of SORL over previous methods in extracting high-quality policies with diverse behaviors. A case in point is that SORL successfully learns strong policies with markedly distinct playing patterns from a real-world human dataset of a popular basketball video game "Dunk City Dynasty."

Details

IJCAI Conference 2024 Conference Paper

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement

Yiwen Zhu
Jinyi Liu
Wenya Wei
Qianyi Fu
Yujing Hu
Zhou Fang
Bo An
Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations -- policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms

Yiwen Zhu
Jinyi Liu
Wenya Wei
Qianyi Fu
Yujing Hu
Zhou Fang
Bo An
Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations – policy evaluation and policy improvement. Actor-critic algorithms dominate the field of RL, but there is a challenge in improving their learning efficiency. To address this, ensemble critics are often employed to enhance policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance the learning efficiency of actor-critic algorithms. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments on Mujoco robotic control tasks and robotic arm tasks with sparse rewards demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

PDF

AAMAS Conference 2023 Conference Paper

Adaptive Value Decomposition with Greedy Marginal Contribution Computation for Cooperative Multi-Agent Reinforcement Learning

Shanqi Liu
Yujing Hu
Runze Wu
Dong Xing
Yu Xiong
Changjie Fan
Kun Kuang
Yong Liu

Real-world cooperation often requires intensive coordination among agents simultaneously. This task has been extensively studied within the framework of cooperative multi-agent reinforcement learning (MARL), and value decomposition methods are among those cuttingedge solutions. However, traditional methods that learn the value function as a monotonic mixing of per-agent utilities cannot solve the tasks with non-monotonic returns. This hinders their application in generic scenarios. Recent methods tackle this problem from the perspective of implicit credit assignment by learning value functions with complete expressiveness or using additional structures to improve cooperation. However, they are either difficult to learn due to large joint action spaces or insufficient to capture the complicated interactions among agents which are essential to solving tasks with non-monotonic returns. Moreover, applications in real-world scenarios usually require policies to be interpretable, but interpretability is limited in the implicit credit assignment methods. To address these problems, we propose a novel explicit credit assignment method to address the non-monotonic problem. Our method, Adaptive Value decomposition with Greedy Marginal contribution (AVGM), is based on an adaptive value decomposition that learns the cooperative value of a group of dynamically changing agents. We first illustrate that the proposed value decomposition can consider the complicated interactions among agents and is feasible to learn in large-scale scenarios. Then, our method uses a greedy marginal contribution computed from the value decomposition as an individual credit to incentivize agents to learn the optimal cooperative policy. We further extend the module with an action encoder to guarantee the linear time complexity for computing the greedy marginal contribution. Experimental results demonstrate that our method achieves significant performance improvements Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. in several non-monotonic domains. Besides, we showcase that our model maintains a good sense of interpretability and rationality. This suggests our model can be applied to scenarios with more realistic demands.

PDF

NeurIPS Conference 2023 Conference Paper

Conservative Offline Policy Adaptation in Multi-Agent Games

Chengjie Wu
Pingzhong Tang
Jun Yang
Yujing Hu
Tangjie Lv
Changjie Fan
Chongjie Zhang

Prior research on policy adaptation in multi-agent games has often relied on online interaction with the target agent in training, which can be expensive and impractical in real-world scenarios. Inspired by recent progress in offline reinforcement learn- ing, this paper studies offline policy adaptation, which aims to utilize the target agent’s behavior data to exploit its weakness or enable effective cooperation. We investigate its distinct challenges of distributional shift and risk-free deviation, and propose a novel learning objective, conservative offline adaptation, that optimizes the worst-case performance against any dataset consistent proxy models. We pro- pose an efficient algorithm called Constrained Self-Play (CSP) that incorporates dataset information into regularized policy learning. We prove that CSP learns a near-optimal risk-free offline adaptation policy upon convergence. Empirical results demonstrate that CSP outperforms non-conservative baselines in various environments, including Maze, predator-prey, MuJoCo, and Google Football.