Author name cluster

Chengdong Ma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

ICLR Conference 2025 Conference Paper

Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

Zhaowei Zhang 0001
Fengshuo Bai
Qizhi Chen
Chengdong Ma
Mingzhi Wang
Haoran Sun
Zilong Zheng
Yaodong Yang 0001

How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce **Amulet**, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

Details

NeurIPS Conference 2025 Conference Paper

EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Qirui Mi
Qipeng Yang
Zijun Fan
Wentian Fan
Heyang Ma
Chengdong Ma
Siyu Xia
Bo An

Artificial intelligence (AI) has become a powerful tool for economic research, enabling large-scale simulation and policy optimization. However, applying AI effectively requires simulation platforms for scalable training and evaluation—yet existing environments remain limited to simplified, narrowly scoped tasks, falling short of capturing complex economic challenges such as demographic shifts, multi-government coordination, and large-scale agent interactions. To address this gap, we introduce EconGym, a scalable and modular testbed that connects diverse economic tasks with AI algorithms. Grounded in rigorous economic modeling, EconGym implements 11 heterogeneous role types (e. g. , households, firms, banks, governments), their interaction mechanisms, and agent models with well-defined observations, actions, and rewards. Users can flexibly compose economic roles with diverse agent algorithms to simulate rich multi-agent trajectories across 25+ economic tasks for AI-driven policy learning and analysis. Experiments show that EconGym supports diverse and cross-domain tasks—such as coordinating fiscal, pension, and monetary policies—and enables benchmarking across AI, economic methods, and hybrids. Results indicate that richer task composition and algorithm diversity expand the policy space, while AI agents guided by classical economic methods perform best in complex settings. EconGym also scales to 100k agents with high realism and efficiency.

PDF Details

NeurIPS Conference 2025 Conference Paper

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Simin Li
Zihao Mao
Hanxiao Li
Zonglei Jing
Zhuohang bian
Jun Guo
Li Wang
Zhuoran Han

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of \emph{robustness}, which ensures stability under uncertainties, and \emph{resilience}, the ability to recover from disruptions—a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82, 620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones.

PDF Details

ICML Conference 2025 Conference Paper

Falcon: Fast Visuomotor Policies via Partial Denoising

Haojun Chen
Minghao Liu
Chengdong Ma
Xiaojian Ma 0001
Zailin Ma
Huimin Wu 0001
Yuanpei Chen
Yifan Zhong

Diffusion policies are widely adopted in complex visuomotor tasks for their ability to capture multimodal action distributions. However, the multiple sampling steps required for action generation significantly harm real-time inference efficiency, which limits their applicability in real-time decision-making scenarios. Existing acceleration techniques either require retraining or degrade performance under low sampling steps. Here we propose Falcon, which mitigates this speed-performance trade-off and achieves further acceleration. The core insight is that visuomotor tasks exhibit sequential dependencies between actions. Falcon leverages this by reusing partially denoised actions from historical information rather than sampling from Gaussian noise at each step. By integrating current observations, Falcon reduces sampling steps while preserving performance. Importantly, Falcon is a training-free algorithm that can be applied as a plug-in to further improve decision efficiency on top of existing acceleration techniques. We validated Falcon in 48 simulated environments and 2 real-world robot experiments. demonstrating a 2-7x speedup with negligible performance degradation, offering a promising direction for efficient visuomotor policy design.

Details

ECAI Conference 2025 Conference Paper

Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles

Jiesong Lian
Yucong Huang
Chengdong Ma
Mingzhi Wang
Ying Wen 0001
Long Hu
Yixue Hao

For solving zero-sum games involving non-transitivity, a useful approach is to maintain a policy population to approximate the Nash Equilibrium (NE). Previous studies have shown that the Policy Space Response Oracles (PSRO) algorithm is an effective framework for solving such games. However, current methods initialize a new policy from scratch or inherit a single historical policy for Best Response (BR), missing the opportunity to leverage past policies to generate a better BR. In this paper, we propose Fusion-PSRO, which employs Nash Policy Fusion to initialize a new policy for BR training. Nash Policy Fusion serves as an implicit guiding policy that starts exploration on the current Meta-NE, thus providing a closer approximation to BR. Moreover, it insightfully captures a weighted moving average of past policies, dynamically adjusting these weights based on the Meta-NE in each iteration. This cumulative process further enhances the policy population. Empirical results on classic benchmarks show that Fusion-PSRO achieves lower exploitability, thereby mitigating the shortcomings of previous research on policy initialization in BR.

Details

ECAI Conference 2025 Conference Paper

Learning Macroeconomic Policies Through Dynamic Stackelberg Mean-Field Games

Qirui Mi
Zhiyu Zhao
Chengdong Ma
Siyu Xia
Yan Song 0003
Mengyue Yang
Jun Wang 0012
Haifeng Zhang 0002

Macroeconomic outcomes emerge from individuals’ decisions, making it essential to model how agents interact with macro policy via consumption, investment, and labor choices. We formulate this as a dynamic Stackelberg game: the government (leader) sets policies, and agents (followers) respond by optimizing their behavior over time. Unlike static models, this dynamic formulation captures temporal dependencies and strategic feedback critical to policy design. However, as the number of agents increases, explicitly simulating all agent–agent and agent–government interactions becomes computationally infeasible. To address this, we propose the Dynamic Stackelberg Mean Field Game (DSMFG) framework, which approximates these complex interactions via agent–population and government–population couplings. This approximation preserves individual-level feedback while ensuring scalability, enabling DSMFG to jointly model three core features of real-world policy-making: dynamic feedback, asymmetry, and large-scale. We further introduce Stackelberg Mean Field Reinforcement Learning (SMFRL), a data-driven algorithm that learns the leader’s optimal policies while maintaining personalized responses for individual agents. Empirically, we validate our approach in a large-scale simulated economy, where it scales to 1, 000 agents (vs. 100 in prior work) and achieves a 4× GDP gain over classical economic methods and a 19× improvement over the static 2022 U. S. federal income tax policy.

Details

ICLR Conference 2025 Conference Paper

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Mingzhi Wang
Chengdong Ma
Qizhi Chen
Linjian Meng
Yang Han
Jiancong Xiao
Zhaowei Zhang 0001
Jing Huo

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.

Details

AAMAS Conference 2025 Conference Paper

Mean Field Correlated Imitation Learning

Zhiyu Zhao
Chengdong Ma
Qirui Mi
Ning Yang
Xue Yan
Mengyue Yang
Haifeng Zhang
Jun Wang

Modeling the behaviors of many-agent games is crucial for capturing the dynamics of large-scale complex systems. This is typically achieved by recovering policies from demonstrations within the Mean Field Game Imitation Learning (MFGIL) framework. However, most MFGIL methods assume that demonstrations are collected from Mean Field Nash Equilibrium (MFNE), implying that agents make decisions independently. When directly applied to situations where agents’ decisions are coordinated, such as publicly routed traffic networks, these techniques often fall short. In this paper, we propose the Adaptive Mean Field Correlated Equilibrium (AMFCE), which introduces a generalized assumption that effectively integrates the correlated behaviors common in real-world systems. We prove the existence of AMFCE under mild conditions and theoretically show that MFNE is a special case of AMFCE. Building upon this, we introduce a new Mean Field Correlated Imitation Learning (MFCIL) algorithm, which recovers expert policy more accurately in scenarios where agents’ decisions are coordinated. We also provide a theoretical upper bound for the error in recovering the expert policy, which is tighter than that of existing methods. Empirical results on real-world traffic flow prediction and large-scale economic simulations demonstrate that MFCIL significantly improves the predictive performance of large populations’ behaviors compared to existing MFGIL baselines. This improvement highlights potential of MFCIL to model real-world multi-agent systems. *Corresponding to Yaodong Yang ⟨yaodong. yang@pku. edu. cn⟩. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

PDF

NeurIPS Conference 2025 Conference Paper

Social World Model-Augmented Mechanism Design Policy Learning

Xiaoyuan Zhang
Yizhe Huang
Chengdong Ma
Zhixun Chen
Long Ma
Yali Du
Song-Chun Zhu
Yaodong Yang

Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e. g. , skills, preferences) and dealing with complex multi-agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real-world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM-AP (Social World Model-Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents' behavior to enhance mechanism design. Specifically, the social world model infers agents' traits from their interaction trajectories and learns a trait-based model to predict agents' responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents' traits online during real-world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency.

PDF Details

AAAI Conference 2025 Conference Paper

Towards Efficient Collaboration via Graph Modeling in Reinforcement Learning

Wenzhe Fan
Zishun Yu
Chengdong Ma
Changye Li
Yaodong Yang
Xinhua Zhang

In multi-agent reinforcement learning, a commonly considered paradigm is centralized training with decentralized execution. However, in this framework, decentralized execution restricts the development of coordinated policies due to the local observation limitation. In this paper, we consider the cooperation among neighboring agents during execution and formulate their interactions as a graph. Thus, we introduce a novel encoder-decoder architecture named Factor-based Multi-Agent Transformer (f-MAT) that utilizes a transformer to enable communication between neighboring agents during both training and execution. By dividing agents into different overlapping groups and representing each group with a factor, f-MAT achieves efficient message passing and parallel action generation through factor-based attention layers. Empirical results in networked systems such as traffic scheduling and power control demonstrate that f-MAT achieves superior performance compared to strong baselines, thereby paving the way for handling complex collaborative problems.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

World Models Should Prioritize the Unification of Physical and Social Dynamics

Xiaoyuan Zhang
Chengdong Ma
Yizhe Huang
Weidong Huang
Siyuan Qi
Song-Chun Zhu
Xue Feng
Yaodong Yang

World models, which explicitly learn environmental dynamics to lay the foundation for planning, reasoning, and decision-making, are rapidly advancing in predicting both physical dynamics and aspects of social behavior, yet predominantly in separate silos. This division results in a systemic failure to model the crucial interplay between physical environments and social constructs, rendering current models fundamentally incapable of adequately addressing the true complexity of real-world systems where physical and social realities are inextricably intertwined. This position paper argues that the systematic, bidirectional unification of physical and social predictive capabilities is the next crucial frontier for world model development. We contend that comprehensive world models must holistically integrate objective physical laws with the subjective, evolving, and context-dependent nature of social dynamics. Such unification is paramount for AI to robustly navigate complex real-world challenges and achieve more generalizable intelligence. This paper substantiates this imperative by analyzing core impediments to integration, proposing foundational guiding principles (ACE Principles), and outlining a conceptual framework alongside a research roadmap towards truly holistic world models.

PDF Details

NeurIPS Conference 2024 Conference Paper

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Yifan Zhong
Chengdong Ma
Xiaoyuan Zhang
Ziran Yang
Haojun Chen
Qingfu Zhang
Siyuan Qi
Yaodong Yang

Current methods for large language model alignment typically use scalar human preference labels. However, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. This paper presents Panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. Panacea trains a single model capable of adapting online and Pareto-optimally to diverse sets of preferences without the need for further tuning. A major challenge here is using a low-dimensional preference vector to guide the model's behavior, despite it being governed by an overwhelmingly large number of parameters. To address this, Panacea is designed to use singular value decomposition (SVD)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. Theoretically, we prove that Panacea recovers the entire Pareto front with common loss aggregation methods under mild conditions. Moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single LLM to represent an exponentially vast spectrum of human preferences through various optimization methods. Our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and Pareto-optimal manner.

PDF Details DOI

IROS Conference 2022 Conference Paper

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Yali Du 0001
Chengdong Ma
Yuchen Liu
Runji Lin
Hao Dong 0003
Jun Wang 0012
Yaodong Yang 0001

Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly, requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirical results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models. The source code of our algorithm and baselines can be found at https://github.com/PKU-MARL/Model-Based-MARL.

Details