Arrow Research search

Author name cluster

Wulong Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

23 papers
2 author rows

Possible papers

23

NeurIPS Conference 2025 Conference Paper

AttentionPredictor: Temporal Patterns Matter for KV Cache Compression

  • Qingyue Yang
  • Jie Wang
  • Xing Li
  • Zhihai Wang
  • Chen Chen
  • Lei Chen
  • Xianzhi Yu
  • Wulong Liu

With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through static modeling of attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the *temporal patterns* in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose **AttentionPredictor**, which is the **first learning-based method to directly predict attention patterns for KV cache compression and critical token identification**. Specifically, AttentionPredictor learns a lightweight, unified convolution model to dynamically capture spatiotemporal patterns and predict the next-token attention scores. An appealing feature of AttentionPredictor is that it accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory, among all transformer layers. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves **13$\times$** KV cache compression and **5. 6$\times$** speedup in a cache offloading scenario with comparable LLM performance, significantly outperforming the state-of-the-arts. The code is available at https: //github. com/MIRALab-USTC/LLM-AttentionPredictor.

ICML Conference 2025 Conference Paper

FlatQuant: Flatness Matters for LLM Quantization

  • Yuxuan Sun
  • Ruikang Liu
  • Haoli Bai
  • Han Bao
  • Kang Zhao
  • Yuening Li
  • Jiaxin Hu
  • Xianzhi Yu

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7. 5%. Additionally, it provides up to 2. 3x prefill speedup and 1. 7x decoding speedup compared to the FP16 model. Code is available at: https: //github. com/ruikangliu/FlatQuant.

ICML Conference 2025 Conference Paper

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

  • Xing Li 0023
  • Zeyu Xing 0002
  • Yiming Li
  • Linping Qu
  • Hui-Ling Zhen
  • Yiwu Yao
  • Wulong Liu
  • Sinno Jialin Pan

KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3. 25-bit mixed precision KV cache quantization for LLMs like Llama-3. 1-8B-Instruct and 4. 0-bit for sensitive models like Qwen2. 5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21. 25% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https: //github. com/cmd2001/KVTuner.

NeurIPS Conference 2024 Conference Paper

Distributional Reinforcement Learning with Regularized Wasserstein Loss

  • Ke Sun
  • Yingnan Zhao
  • Wulong Liu
  • Bei Jiang
  • Linglong Kong

The empirical success of distributional reinforcement learning (RL) highly relies on the choice of distribution divergence equipped with an appropriate distribution representation. In this paper, we propose \textit{Sinkhorn distributional RL (SinkhornDRL)}, which leverages Sinkhorn divergence—a regularized Wasserstein loss—to minimize the difference between current and target Bellman return distributions. Theoretically, we prove the contraction properties of SinkhornDRL, aligning with the interpolation nature of Sinkhorn divergence between Wasserstein distance and Maximum Mean Discrepancy (MMD). The introduced SinkhornDRL enriches the family of distributional RL algorithms, contributing to interpreting the algorithm behaviors compared with existing approaches by our investigation into their relationships. Empirically, we show that SinkhornDRL consistently outperforms or matches existing algorithms on the Atari games suite and particularly stands out in the multi-dimensional reward setting. \thanks{Code is available in \url{https: //github. com/datake/SinkhornDistRL}. }.

TMLR Journal 2023 Journal Article

Differentiable Logic Machines

  • Matthieu Zimmer
  • Xuening Feng
  • Claire Glanois
  • Zhaohui Jiang
  • Jianyi Zhang
  • Paul Weng
  • Dong Li
  • Jianye Hao

The integration of reasoning, learning, and decision-making is key to build more general artificial intelligence systems. As a step in this direction, we propose a novel neural-logic architecture, called differentiable logic machine (DLM), that can solve both inductive logic programming (ILP) and reinforcement learning (RL) problems, where the solution can be interpreted as a first-order logic program. Our proposition includes several innovations. Firstly, our architecture defines a restricted but expressive continuous relaxation of the space of first-order logic programs by assigning weights to predicates instead of rules, in contrast to most previous neural-logic approaches. Secondly, with this differentiable architecture, we propose several (supervised and RL) training procedures, based on gradient descent, which can recover a fully-interpretable solution (i.e., logic formula). Thirdly, to accelerate RL training, we also design a novel critic architecture that enables actor-critic algorithms. Fourthly, to solve hard problems, we propose an incremental training procedure that can learn a logic program progressively. Compared to state-of-the-art (SOTA) differentiable ILP methods, DLM successfully solves all the considered ILP problems with a higher percentage of successful seeds (up to 3.5x). On RL problems, without requiring an interpretable solution, DLM outperforms other non-interpretable neural-logic RL approaches in terms of rewards (up to 3.9%). When enforcing interpretability, DLM can solve harder RL problems (e.g., Sorting, Path) than other interpretable RL methods. Moreover, we show that deep logic programs can be learned via incremental supervised training. In addition to this excellent performance, DLM can scale well in terms of memory and computational time, especially during the testing phase where it can deal with much more constants (>2x) than SOTA.

NeurIPS Conference 2022 Conference Paper

Conformalized Fairness via Quantile Regression

  • Meichen Liu
  • Lei Ding
  • Dengdeng Yu
  • Wulong Liu
  • Linglong Kong
  • Bei Jiang

Algorithmic fairness has received increased attention in socially sensitive domains. While rich literature on mean fairness has been established, research on quantile fairness remains sparse but vital. To fulfill great needs and advocate the significance of quantile fairness, we propose a novel framework to learn a real-valued quantile function under the fairness requirement of Demographic Parity with respect to sensitive attributes, such as race or gender, and thereby derive a reliable fair prediction interval. Using optimal transport and functional synchronization techniques, we establish theoretical guarantees of distribution-free coverage and exact fairness for the induced prediction interval constructed by fair quantiles. A hands-on pipeline is provided to incorporate flexible quantile regressions with an efficient fairness adjustment post-processing algorithm. We demonstrate the superior empirical performance of this approach on several benchmark datasets. Our results show the model’s ability to uncover the mechanism underlying the fairness-accuracy trade-off in a wide range of societal and medical applications.

ICML Conference 2022 Conference Paper

Neuro-Symbolic Hierarchical Rule Induction

  • Claire Glanois
  • Zhao-Hui Jiang
  • Xuening Feng
  • Paul Weng
  • Matthieu Zimmer
  • Dong Li 0016
  • Wulong Liu
  • Jianye Hao

We propose Neuro-Symbolic Hierarchical Rule Induction, an efficient interpretable neuro-symbolic model, to solve Inductive Logic Programming (ILP) problems. In this model, which is built from a pre-defined set of meta-rules organized in a hierarchical structure, first-order rules are invented by learning embeddings to match facts and body predicates of a meta-rule. To instantiate, we specifically design an expressive set of generic meta-rules, and demonstrate they generate a consequent fragment of Horn clauses. As a differentiable model, HRI can be trained both via supervised learning and reinforcement learning. To converge to interpretable rules, we inject a controlled noise to avoid local optima and employ an interpretability-regularization term. We empirically validate our model on various tasks (ILP, visual genome, reinforcement learning) against relevant state-of-the-art methods, including traditional ILP methods and neuro-symbolic models.

AAAI Conference 2022 Conference Paper

What about Inputting Policy in Value Function: Policy Representation and Policy-Extended Value Function Approximator

  • Hongyao Tang
  • Zhaopeng Meng
  • Jianye Hao
  • Chen Chen
  • Daniel Graves
  • Dong Li
  • Changmin Yu
  • Hangyu Mao

We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i. e. , value generalization among policies. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or stateaction pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40% performance improvement on its vanilla counterpart in most environments.

NeurIPS Conference 2021 Conference Paper

Adaptive Online Packing-guided Search for POMDPs

  • Chenyang Wu
  • Guoyu Yang
  • Zongzhang Zhang
  • Yang Yu
  • Dong Li
  • Wulong Liu
  • Jianye Hao

The partially observable Markov decision process (POMDP) provides a general framework for modeling an agent's decision process with state uncertainty, and online planning plays a pivotal role in solving it. A belief is a distribution of states representing state uncertainty. Methods for large-scale POMDP problems rely on the same idea of sampling both states and observations. That is, instead of exact belief updating, a collection of sampled states is used to approximate the belief; instead of considering all possible observations, only a set of sampled observations are considered. Inspired by this, we take one step further and propose an online planning algorithm, Adaptive Online Packing-guided Search (AdaOPS), to better approximate beliefs with adaptive particle filter technique and balance estimation bias and variance by fusing similar observation branches. Theoretically, our algorithm is guaranteed to find an $\epsilon$-optimal policy with a high probability given enough planning time under some mild assumptions. We evaluate our algorithm on several tricky POMDP domains, and it outperforms the state-of-the-art in all of them.

AAAI Conference 2021 Conference Paper

Addressing Action Oscillations through Learning Policy Inertia

  • Chen Chen
  • Hongyao Tang
  • Jianye Hao
  • Wulong Liu
  • Zhaopeng Meng

Deep reinforcement learning (DRL) algorithms have been demonstrated to be effective in a wide range of challenging decision making and control tasks. However, these methods typically suffer from severe action oscillations in particular in discrete action setting, which means that agents select different actions within consecutive steps even though states only slightly differ. This issue is often neglected since the policy is usually evaluated by its cumulative rewards only. Action oscillation strongly affects the user experience and can even cause serious potential security menace especially in realworld domains with the main concern of safety, such as autonomous driving. To this end, we introduce Policy Inertia Controller (PIC) which serves as a generic plug-in framework to off-the-shelf DRL algorithms, to enables adaptive trade-off between the optimality and smoothness of the learned policy in a formal way. We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy which ensures monotonically non-decreasing updates under some mild conditions. Further, we derive a practical DRL algorithm, namely Nested Soft Actor-Critic. Experiments on a collection of autonomous driving tasks and several Atari games suggest that our approach demonstrates substantial oscillation reduction in comparison to a range of commonly adopted baselines with almost no performance degradation.

NeurIPS Conference 2021 Conference Paper

An Efficient Transfer Learning Framework for Multiagent Reinforcement Learning

  • Tianpei Yang
  • Weixun Wang
  • Hongyao Tang
  • Jianye Hao
  • Zhaopeng Meng
  • Hangyu Mao
  • Dong Li
  • Wulong Liu

Transfer Learning has shown great potential to enhance single-agent Reinforcement Learning (RL) efficiency. Similarly, Multiagent RL (MARL) can also be accelerated if agents can share knowledge with each other. However, it remains a problem of how an agent should learn from other agents. In this paper, we propose a novel Multiagent Policy Transfer Framework (MAPTF) to improve MARL efficiency. MAPTF learns which agent's policy is the best to reuse for each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. Furthermore, in practice, the option module can only collect all agent's local experiences for update due to the partial observability of the environment. While in this setting, each agent's experience may be inconsistent with each other, which may cause the inaccuracy and oscillation of the option-value's estimation. Therefore, we propose a novel option learning algorithm, the successor representation option learning to solve it by decoupling the environment dynamics from rewards and learning the option-value under each agent's preference. MAPTF can be easily combined with existing deep RL and MARL approaches, and experimental results show it significantly boosts the performance of existing methods in both discrete and continuous state spaces.

AAAI Conference 2021 Short Paper

Enhancing Context-Based Meta-Reinforcement Learning Algorithms via An Efficient Task Encoder (Student Abstract)

  • Feng Xu
  • Shengyi Jiang
  • Hao Yin
  • Zongzhang Zhang
  • Yang Yu
  • Ming Li
  • Dong Li
  • Wulong Liu

Meta-Reinforcement Learning (meta-RL) algorithms enable agents to adapt to new tasks from small amounts of exploration, based on the experience of similar tasks. Recent studies have pointed out that a good representation of a task is key to the success of off-policy context-based meta-RL. Inspired by contrastive methods in unsupervised representation learning, we propose a new method to learn the task representation based on the mutual information between transition tuples in a trajectory and the task embedding. We also propose a new estimation for task similarity based on Q-function, which can be used to form a constraint on the distribution of the encoded task variables, making the task encoder encode the task variables more effective on new tasks. Experiments on meta-RL tasks show that the newly proposed method outperforms existing meta-RL algorithms.

AAAI Conference 2021 Conference Paper

Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

  • Hongyao Tang
  • Zhaopeng Meng
  • Guangyong Chen
  • Pengfei Chen
  • Chen Chen
  • Yaodong Yang
  • Luo Zhang
  • Wulong Liu

Value function is the central notion of Reinforcement Learning (RL). Value estimation, especially with function approximation, can be challenging since it involves the stochasticity of environmental dynamics and reward signals that can be sparse and delayed in some cases. A typical model-free RL algorithm usually estimates the values of a policy by Temporal Difference (TD) or Monte Carlo (MC) algorithms directly from rewards, without explicitly taking dynamics into consideration. In this paper, we propose Value Decomposition with Future Prediction (VDFP), providing an explicit two-step understanding of the value estimation process: 1) first foresee the latent future, 2) and then evaluate it. We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation. Further, we derive a practical deep RL algorithm, consisting of a convolutional model to learn compact trajectory representation from past experiences, a conditional variational auto-encoder to predict the latent future dynamics and a convex return model that evaluates trajectory representation. In experiments, we empirically demonstrate the effectiveness of our approach for both off-policy and on-policy RL in several OpenAI Gym continuous control tasks as well as a few challenging variants with delayed reward.

AAAI Conference 2021 Short Paper

LB-DESPOT: Efficient Online POMDP Planning Considering Lower Bound in Action Selection (Student Abstract)

  • Chenyang Wu
  • Rui Kong
  • Guoyu Yang
  • Xianghan Kong
  • Zongzhang Zhang
  • Yang Yu
  • Dong Li
  • Wulong Liu

Partially observable Markov decision process (POMDP) is an extension to MDP. It handles the state uncertainty by specifying the probability of getting a particular observation given the current state. DESPOT is one of the most popular scalable online planning algorithms for POMDPs, which manages to significantly reduce the size of the decision tree while deriving a near-optimal policy by considering only K scenarios. Nevertheless, there is a gap in action selection criteria between planning and execution in DESPOT. During the planning stage, it keeps choosing the action with the highest upper bound, whereas when the planning ends, the action with the highest lower bound is chosen for execution. Here, we propose LB-DESPOT to alleviate this issue, which utilizes the lower bound in selecting an action branch to expand. Empirically, our method has attained better performance than DESPOT and POMCP, which is another state-of-the-art, on several challenging POMDP benchmark tasks.

NeurIPS Conference 2021 Conference Paper

Model-Based Reinforcement Learning via Imagination with Derived Memory

  • Yao Mu
  • Yuzheng Zhuang
  • Bin Wang
  • Guangxiang Zhu
  • Wulong Liu
  • Jianyu Chen
  • Ping Luo
  • Shengbo Li

Model-based reinforcement learning aims to improve the sample efficiency of policy learning by modeling the dynamics of the environment. Recently, the latent dynamics model is further developed to enable fast planning in a compact space. It summarizes the high-dimensional experiences of an agent, which mimics the memory function of humans. Learning policies via imagination with the latent model shows great potential for solving complex tasks. However, only considering memories from the true experiences in the process of imagination could limit its advantages. Inspired by the memory prosthesis proposed by neuroscientists, we present a novel model-based reinforcement learning framework called Imagining with Derived Memory (IDM). It enables the agent to learn policy from enriched diverse imagination with prediction-reliability weight, thus improving sample efficiency and policy robustness. Experiments on various high-dimensional visual control tasks in the DMControl benchmark demonstrate that IDM outperforms previous state-of-the-art methods in terms of policy robustness and further improves the sample efficiency of the model-based method.

IROS Conference 2021 Conference Paper

Reinforcement Learning based Negotiation-aware Motion Planning of Autonomous Vehicles

  • Zhitao Wang
  • Yuzheng Zhuang
  • Qiang Gu
  • Dong Chen
  • Hongbo Zhang
  • Wulong Liu

For autonomous vehicles integrating onto road-ways with human traffic participants, it requires understanding and adapting to the participants’ intention by responding in predictable ways. This paper proposes a reinforcement learning based negotiation-aware motion planning framework, which adopts RL to adjust the driving style of the planner by dynamically modifying the prediction horizon length of the motion planner in real time adaptively. The framework models the interaction between the autonomous vehicle and other traffic participants as a Markov Decision Process. A temporal sequence of occupancy grid maps are taken as inputs for RL module to embed an implicit intention reasoning. Curriculum learning is employed to enhance the training efficiency and the robustness of the algorithm. We applied our method to narrow lane navigation in both simulation and real world to demonstrate that the proposed method outperforms the common alternative due to its advantage in alleviating the social dilemma problem with proper negotiation skills.

ICRA Conference 2021 Conference Paper

Relational Navigation Learning in Continuous Action Space among Crowds

  • Xueyou Zhang
  • Wei Xi 0002
  • Xian Guo
  • Yongchun Fang
  • Bin Wang 0034
  • Wulong Liu
  • Jianye Hao

In this paper, a novel navigation learning method in continuous action space among crowds based on relational graph is proposed which can be directly deployed on differential-drive mobile robots without any change. More specifically, in order to increase generalization ability in crowd sizes, Graph Convolutional Network (GCN) is at first adopted to extract the relationships between robot and pedestrians. Then the relation features are further utilized as the inputs of the pedestrian state prediction network, the actor network, and the critic network. To efficiently and safely learn the navigation policy, all networks are pretrained through imitating ORCA which is a state-of-the-art algorithm in crowd navigation, and then a model-based reinforcement learning (RL) method which combines the model prediction and the clipped advantage-weighted regression is proposed to finetune the networks. Finally, simulation experiments are performed and it’s verified that the proposed learning method performs significantly better than ORCA and the other state-of-the-art RL methods.

NeurIPS Conference 2021 Conference Paper

S$^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks

  • Xinlin Li
  • Bang Liu
  • Yaoliang Yu
  • Wulong Liu
  • Chunjing Xu
  • Vahid Partovi Nia

Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy-efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization and yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S$^3$ re-parameterization, a novel technique for training low-bit shift networks. Our method decomposes a discrete parameter in a sign-sparse-shift 3-fold manner. This way, it efficiently learns a low-bit network with weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks compete with their full-precision counterparts in terms of top-1 accuracy on ImageNet.

AAAI Conference 2021 Conference Paper

Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

  • Haotian Fu
  • Hongyao Tang
  • Jianye Hao
  • Chen Chen
  • Xidong Feng
  • Dong Li
  • Wulong Liu

Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta- RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the taskspecific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparse-reward environments. The experimental results show that CCM outperforms state-of-the-art algorithms by addressing previously mentioned problems respectively.

IJCAI Conference 2020 Conference Paper

Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

  • Tianpei Yang
  • Jianye Hao
  • Zhaopeng Meng
  • Zongzhang Zhang
  • Yujing Hu
  • Yingfeng Chen
  • Changjie Fan
  • Weixun Wang

Transfer learning has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing approaches either transfer previous knowledge by explicitly computing similarities between tasks or select appropriate source policies to provide guided explorations. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarities is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) by taking advantage of this idea. PTF learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as an option learning problem. PTF can be easily combined with existing DRL methods and experimental results show it significantly accelerates RL and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.

ICLR Conference 2020 Conference Paper

Multi-Agent Interactions Modeling with Correlated Policies

  • Minghuan Liu
  • Ming Zhou 0006
  • Weinan Zhang 0001
  • Yuzheng Zhuang
  • Jun Wang 0012
  • Wulong Liu
  • Yong Yu 0001

In multi-agent systems, complex interacting behaviors arise due to the high correlations among agents. However, previous work on modeling multi-agent interactions from demonstrations is primarily constrained by assuming the independence among policies and their reward structures. In this paper, we cast the multi-agent interactions modeling problem into a multi-agent imitation learning framework with explicit modeling of correlated policies by approximating opponents’ policies, which can recover agents' policies that can regenerate similar interactions. Consequently, we develop a Decentralized Adversarial Imitation Learning algorithm with Correlated policies (CoDAIL), which allows for decentralized training and execution. Various experiments demonstrate that CoDAIL can better regenerate complex interactions close to the demonstrators and outperforms state-of-the-art multi-agent imitation learning methods. Our code is available at \url{https://github.com/apexrl/CoDAIL}.

AAAI Conference 2020 Conference Paper

Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

  • Hangyu Mao
  • Wulong Liu
  • Jianye Hao
  • Jun Luo
  • Dong Li
  • Zhengchao Zhang
  • Jun Wang
  • Zhen Xiao

Social psychology and real experiences show that cognitive consistency plays an important role to keep human society in order: if people have a more consistent cognition about their environments, they are more likely to achieve better cooperation. Meanwhile, only cognitive consistency within a neighborhood matters because humans only interact directly with their neighbors. Inspired by these observations, we take the first step to introduce neighborhood cognitive consistency (NCC) into multi-agent reinforcement learning (MARL). Our NCC design is quite general and can be easily combined with existing MARL methods. As examples, we propose neighborhood cognition consistent deep Q-learning and Actor-Critic to facilitate large-scale multi-agent cooperations. Extensive experiments on several challenging tasks (i. e. , packet routing, wifi configuration and Google football player control) justify the superior performance of our methods compared with state-of-the-art MARL approaches.

IJCAI Conference 2020 Conference Paper

Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

  • Cong Fei
  • Bin Wang
  • Yuzheng Zhuang
  • Zongzhang Zhang
  • Jianye Hao
  • Hongbo Zhang
  • Xuewu Ji
  • Wulong Liu

Generative adversarial imitation learning (GAIL) has shown promising results by taking advantage of generative adversarial nets, especially in the field of robot learning. However, the requirement of isolated single modal demonstrations limits the scalability of the approach to real world scenarios such as autonomous vehicles' demand for a proper understanding of human drivers' behavior. In this paper, we propose a novel multi-modal GAIL framework, named Triple-GAIL, that is able to learn skill selection and imitation jointly from both expert demonstrations and continuously generated experiences with data augmentation purpose by introducing an auxiliary selector. We provide theoretical guarantees on the convergence to optima for both of the generator and the selector respectively. Experiments on real driver trajectories and real-time strategy game datasets demonstrate that Triple-GAIL can better fit multi-modal behaviors close to the demonstrators and outperforms state-of-the-art methods.