Author name cluster

Wenqi Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

ICLR Conference 2025 Conference Paper

DoF: A Diffusion Factorization Framework for Offline Multi-Agent Reinforcement Learning

Chao Li
Ziwei Deng
Chenxing Lin
Wenqi Chen
Yongquan Fu
Weiquan Liu
Chenglu Wen
Cheng Wang 0003

Diffusion models have been widely adopted in image and language generation and are now being applied to reinforcement learning. However, the application of diffusion models in offline cooperative Multi-Agent Reinforcement Learning (MARL) remains limited. Although existing studies explore this direction, they suffer from scalability or poor cooperation issues due to the lack of design principles for diffusion-based MARL. The Individual-Global-Max (IGM) principle is a popular design principle for cooperative MARL. By satisfying this principle, MARL algorithms achieve remarkable performance with good scalability. In this work, we extend the IGM principle to the Individual-Global-identically-Distributed (IGD) principle. This principle stipulates that the generated outcome of a multi-agent diffusion model should be identically distributed as the collective outcomes from multiple individual-agent diffusion models. We propose DoF, a diffusion factorization framework for Offline MARL. It uses noise factorization function to factorize a centralized diffusion model into multiple diffusion models. We theoretically show that the noise factorization functions satisfy the IGD principle. Furthermore, DoF uses data factorization function to model the complex relationship among data generated by multiple diffusion models. Through extensive experiments, we demonstrate the effectiveness of DoF. The source code is available at [https://github.com/xmu-rl-3dv/DoF](https://github.com/xmu-rl-3dv/DoF).

Details

NeurIPS Conference 2025 Conference Paper

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Jiayi Zhou
Jiaming Ji
Boyuan Chen
Jiapeng Sun
wenqi chen
Donghai Hong
Sirui Han
Yike Guo

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, \textit{e. g. ,} reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: \textbf{multi-modal generative reward modeling from RL}, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and \textbf{RL optimization from grouped comparison}, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by 18. 1\%, while the baseline RLHF is only 5. 3\%. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses.

PDF Details

ICML Conference 2025 Conference Paper

HyperIV: Real-time Implied Volatility Smoothing

Yongxin Yang
Wenqi Chen
Chao Shu
Timothy M. Hospedales

We propose HyperIV, a novel approach for real-time implied volatility smoothing that eliminates the need for traditional calibration procedures. Our method employs a hypernetwork to generate parameters for a compact neural network that constructs complete volatility surfaces within 2 milliseconds, using only 9 market observations. Moreover, the generated surfaces are guaranteed to be free of static arbitrage. Extensive experiments across 8 index options demonstrate that HyperIV achieves superior accuracy compared to existing methods while maintaining computational efficiency. The model also exhibits strong cross-asset generalization capabilities, indicating broader applicability across different market instruments. These key features – rapid adaptation to market conditions, guaranteed absence of arbitrage, and minimal data requirements – make HyperIV particularly valuable for real-time trading applications. We make code available at https: //github. com/qmfin/hyperiv.

Details

NeurIPS Conference 2025 Conference Paper

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Boyuan Chen
Donghai Hong
Jiaming Ji
Jiacheng Zheng
Bowen Dong
Jiayi Zhou
Kaile Wang
Juntao Dai

As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: \textbf{\textit{What essential capabilities are still missing? }}A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support \textbf{multi-turn}, \textbf{multimodal interaction}. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present \textbf{an initial exploration} through the \textsc{InterMT} -- \textbf{the first preference dataset for \textit{multi-turn} multimodal interaction}, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. \textsc{InterMT} captures human preferences at both global and local levels into nine sub-dimensions, consists of 15. 6k prompts, 52. 6k multi-turn dialogue instances, and 32. 4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce \textsc{InterMT-Bench} to assess the ability ofMLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \textsc{InterMT} through applications such as judge moderation and further reveal the \textit{multi-turn scaling law} of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step.

PDF Details

ICML Conference 2023 Conference Paper

Quantile Credit Assignment

Thomas Mesnard
Wenqi Chen
Alaa Saade
Yunhao Tang
Mark Rowland 0001
Theophane Weber
Clare Lyle
Audrunas Gruslys

In reinforcement learning, the credit assignment problem is to distinguish luck from skill, that is, separate the inherent randomness in the environment from the controllable effects of the agent’s actions. This paper proposes two novel algorithms, Quantile Credit Assignment (QCA) and Hindsight QCA (HQCA), which incorporate distributional value estimation to perform credit assignment. QCA uses a network that predicts the quantiles of the return distribution, whereas HQCA additionally incorporates information about the future. Both QCA and HQCA have the appealing interpretation of leveraging an estimate of the quantile level of the return (interpreted as the level of "luck") in order to derive a "luck-dependent" baseline for policy gradient methods. We show theoretically that this approach gives an unbiased policy gradient estimate that can yield significant variance reductions over a standard value estimate baseline. QCA and HQCA significantly outperform prior state-of-the-art methods on a range of extremely difficult credit assignment problems.

Details