Arrow Research search

Author name cluster

Yi-Chen Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
1 author row

Possible papers

6

AAAI Conference 2026 Conference Paper

Multi-agent In-context Coordination via Decentralized Memory Retrieval

  • Tao Jiang
  • Zichuan Lin
  • Lihe Li
  • Yi-Chen Li
  • Cong Guan
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods.

NeurIPS Conference 2025 Conference Paper

Multi-Agent Imitation by Learning and Sampling from Factorized Soft Q-Function

  • Yi-Chen Li
  • Zhongxiang Ling
  • Tao Jiang
  • Fuxiang Zhang
  • Pengyuan Wang
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Learning from multi-agent expert demonstrations, known as Multi-Agent Imitation Learning (MAIL), provides a promising approach to sequential decision-making. However, existing MAIL methods including Behavior Cloning (BC) and Adversarial Imitation Learning (AIL) face significant challenges: BC suffers from the compounding error issue, while the very nature of adversarial optimization makes AIL prone to instability. In this work, we propose \textbf{M}ulti-\textbf{A}gent imitation by learning and sampling from \textbf{F}actor\textbf{I}zed \textbf{S}oft Q-function (MAFIS), a novel method that addresses these limitations for both online and offline MAIL settings. Built upon the single-agent IQ-Learn framework, MAFIS introduces the value decomposition network to factorize the imitation objective at agent level, thus enabling scalable training for multi-agent systems. Moreover, we observe that the soft Q-function implicitly defines the optimal policy as an energy-based model, from which we can sample actions via stochastic gradient Langevin dynamics. This allows us to estimate the gradient of the factorized optimization objective for continuous control tasks, avoiding the adversarial optimization between the soft Q-function and the policy required by prior work. By doing so, we obtain a tractable and \emph{non-adversarial} objective for both discrete and continuous multi-agent control. Experiments on common benchmarks including the discrete control tasks StarCraft Multi-Agent Challenge v2 (SMACv2), Gold Miner, and Multi Particle Environments (MPE), as well as the continuous control task Multi-Agent MuJoCo (MaMuJoCo), demonstrate that MAFIS achieves superior performance compared with baselines. Our code is available at https: //github. com/LAMDA-RL/MAFIS.

IJCAI Conference 2024 Conference Paper

Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal

  • Lihe Li
  • Ruotong Chen
  • Ziqian Zhang
  • Zhichao Wu
  • Yi-Chen Li
  • Cong Guan
  • Yang Yu
  • Lei Yuan

Multi-objective reinforcement learning (MORL) approaches address real-world problems with multiple objectives by learning policies maximizing returns weighted by different user preferences. Typical methods assume the objectives remain unchanged throughout the agent's lifetime. However, in some real-world situations, the agent may encounter dynamically changing learning objectives, i. e. , different vector-valued reward functions at different learning stages. This issue has not been considered in problem formulation or algorithm design. To address this issue, we formalize the setting as a continual MORL (CMORL) problem for the first time, accounting for the evolution of objectives throughout the learning process. Subsequently, we propose Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal (CORe3), incorporating a dynamic agent network for rapid adaptation to new objectives. Moreover, we develop a reward model rehearsal technique to recover the reward signals for previous objectives, thus alleviating catastrophic forgetting. Experiments on four CMORL benchmarks showcase that CORe3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORe3 to handle situations with evolving objectives.

AAMAS Conference 2024 Conference Paper

Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation

  • Cong Guan
  • Ruiqi Xue
  • Ziqian Zhang
  • Lihe Li
  • Yi-Chen Li
  • Lei Yuan
  • Yang Yu

Despite the gained prominence made by reinforcement learning (RL) in various domains, ensuring safety in real-world applications remains a significant challenge. Offline safe RL, which learns safe policies from pre-collected data, has emerged to address these concerns. However, existing approaches assume a single constraint mode and lack adaptability to diverse safety constraints. In realworld scenarios, we often find ourselves working with datasets gathered from various tasks, with the aim of developing a generalized policy capable of handling unknown tasks during testing. To deal with such offline safe meta RL problem, we introduce a novel framework called COSTA, which is designed to facilitate the learning of a safe generalized policy that can adapt and be transferred ∗Yang Yu is the corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). to unknown tasks during testing. COSTA addresses two key challenges in offline safe meta RL: First, it develops a cost-aware task inference module using contrastive learning to distinguish tasks based on safety constraints, mitigating the MDP ambiguity problem. Second, COSTA introduces a novel metric, Safe In-Distribution Score (SIDS), to assess the in-distribution degree of trajectories, out of the consideration of both reward maximization and cost constraint satisfaction. Trajectories collected with a safe exploration policy are filtered using SIDS for robust online task adaptation. Experimental results in a tailored benchmark suite within the Mujoco environments demonstrate that COSTA consistently balances safety and reward maximization, outperforming multiple baselines.

AAMAS Conference 2024 Conference Paper

Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation

  • Chengxing Jia
  • Fuxiang Zhang
  • Yi-Chen Li
  • Chen-Xiao Gao
  • Xu-Hui Liu
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Offline meta-reinforcement learning (OMRL) proficiently allows an agent to tackle novel tasks while solely relying on a static dataset. For precise and efficient task identification, existing OMRL research suggests learning separate task representations that be incorporated with policy input, thus forming a context-based meta-policy. A major approach to train task representations is to adopt contrastive learning using multi-task offline data. The dataset typically encompasses interactions from various policies (i. e. , the behavior policies), thus providing a plethora of contextual information regarding different tasks. Nonetheless, amassing data from a substantial number of policies is not only impractical but also often unattainable in realistic settings. Instead, we resort to a more constrained yet practical scenario, where multi-task data collection occurs with a limited number of policies. We observed that learned task representations from previous OMRL methods tend to correlate spuriously with the behavior policy instead of reflecting the essential characteristics of the task, resulting in unfavorable out-of-distribution generalization. *These authors contributed equally. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). To alleviate this issue, we introduce a novel algorithm to disentangle the impact of behavior policy from task representation learning through a process called adversarial data augmentation. Specifically, the objective of adversarial data augmentation is not merely to generate data analogous to offline data distribution; instead, it aims to create adversarial examples designed to confound learned task representations and lead to incorrect task identification. Our experiments show that learning from such adversarial samples significantly enhances the robustness and effectiveness of the task identification process and realizes satisfactory out-of-distribution generalization. The results in MuJoCo locomotion tasks demonstrate that our approach surpasses other OMRL baselines across various meta-learning task sets.

AAAI Conference 2023 Short Paper

Learning Generalizable Batch Active Learning Strategies via Deep Q-networks (Student Abstract)

  • Yi-Chen Li
  • Wen-Jie Shen
  • Boyu Zhang
  • Feng Mao
  • Zongzhang Zhang
  • Yang Yu

To handle a large amount of unlabeled data, batch active learning (BAL) queries humans for the labels of a batch of the most valuable data points at every round. Most current BAL strategies are based on human-designed heuristics, such as uncertainty sampling or mutual information maximization. However, there exists a disagreement between these heuristics and the ultimate goal of BAL, i.e., optimizing the model's final performance within the query budgets. This disagreement leads to a limited generality of these heuristics. To this end, we formulate BAL as an MDP and propose a data-driven approach based on deep reinforcement learning. Our method learns the BAL strategy by maximizing the model's final performance. Experiments on the UCI benchmark show that our method can achieve competitive performance compared to existing heuristics-based approaches.