Author name cluster

Dawei Feng

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2025 Conference Paper

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Yuanzhao Zhai
Tingkai Yang
Kele Xu
Dawei Feng
Cheng Yang
Bo Ding
Huaimin Wang

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

PDF Details DOI

ICML Conference 2025 Conference Paper

Improving the Continuity of Goal-Achievement Ability via Policy Self-Regularization for Goal-Conditioned Reinforcement Learning

Xudong Gong
Sen Yang 0003
Dawei Feng
Kele Xu
Bo Ding 0001
Huaimin Wang 0001
Yong Dou

This paper addresses the challenge of discontinuity in goal-achievement capabilities observed in Goal-conditioned Reinforcement Learning (GCRL) algorithms. Through a theoretical analysis, we identify that the reuse of successful trajectories or policies during training can aid in achieving adjacent goals of achievable goals. However, the policy discrepancy between achievable and adjacent goals must be carefully managed to avoid both overly trivial and excessively large differences, which can respectively hinder policy performance. To tackle this issue, we propose a margin-based policy self-regularization approach that optimizes the policy discrepancies between adjacent desired goals to a minimal acceptable threshold. This method can be integrated into popular GCRL algorithms, such as GC-SAC, HER, and GC-PPO. Systematic evaluations across two robotic arm control tasks and a complex fixed-wing aircraft control task demonstrate that our approach significantly improves the continuity of goal-achievement abilities of GCRL algorithms, thereby enhancing their overall performance.

Details

ICRA Conference 2025 Conference Paper

V-Pilot: A Velocity Vector Control Agent for Fixed-Wing UAVs from Imperfect Demonstrations

Xudong Gong
Dawei Feng
Kele Xu
Xing Zhou 0004
Si Zheng
Bo Ding 0001
Huaimin Wang 0001

This paper addresses the challenge of Velocity Vector Control (VVC) for fixed-wing UAVs using Reinforcement Learning (RL) in the presence of imperfect demonstrations. The multi-objective and long-horizon nature of VVC introduces significant spatial and temporal complexities, complicating RL's exploration. While demonstration-based RL methods can help mitigate exploration challenges, their effectiveness is often limited by the quality of the provided demonstrations. To tackle this, we propose V-Pilot, a novel approach that integrates: (1) a controller equipped with a control law model to reduce action oscillation, thus alleviating temporal exploration issues, and (2) a VVC-specific training workflow for iterative policy refinement and demonstration quality improvement. This framework is designed to enhance the performance of demonstration-based RL under imperfect demonstrations. We evaluate V-Pilot on the fixed-wing UAV RL environment, VVCGym. Experimental results demonstrate that V-Pilot outperforms PID and Behavioral Cloning across multiple performance metrics.

Details

ICLR Conference 2025 Conference Paper

VVC-Gym: A Fixed-Wing UAV Reinforcement Learning Environment for Multi-Goal Long-Horizon Problems

Xudong Gong
Dawei Feng
Kele Xu
Weijia Wang
Zhangjun Sun
Xing Zhou 0004
Si Zheng
Bo Ding 0001

Multi-goal long-horizon problems are prevalent in real-world applications. The additional goal space introduced by multi-goal problems intensifies the spatial complexity of exploration; meanwhile, the long interaction sequences in long-horizon problems exacerbate the temporal complexity of exploration. Addressing the great exploration challenge posed by multi-goal long-horizon problems depends not only on the design of algorithms but also on the design of environments and the availability of demonstrations to assist in training. To facilitate the above research, we propose a multi-goal long-horizon Reinforcement Learning (RL) environment based on realistic fixed-wing UAV's velocity vector control, named VVC-Gym, and generate multiple demonstration sets of various quality. Through experimentation, we analyze the impact of different environment designs on training, assess the quantity and quality of demonstrations and their influence on training, and assess the effectiveness of various RL algorithms, providing baselines on VVC-Gym and its corresponding demonstrations. The results suggest that VVC-Gym is suitable for studying: (1) the influence of environment designs on addressing multi-goal long-horizon problems with RL. (2) the assistance that demonstrations can provide in overcoming the exploration challenges of multi-goal long-horizon problems. (3) the RL algorithm designs with the least possible impact from environment designs on the efficiency and effectiveness of training.

Details

NeurIPS Conference 2024 Conference Paper

Goal-Conditioned On-Policy Reinforcement Learning

Xudong Gong
Dawei Feng
Kele Xu
Bo Ding
Huaimin Wang

Existing Goal-Conditioned Reinforcement Learning (GCRL) algorithms are built upon Hindsight Experience Replay (HER), which densifies rewards through hindsight replay and leverages historical goal-achieving information to construct a learning curriculum. However, when the task is characterized by a non-Markovian reward (NMR), whose computation depends on multiple steps of states and actions, HER can no longer densify rewards by treating a single encountered state as the hindsight goal. The lack of informative rewards hinders policy learning, resulting in rolling out failed trajectories. Consequently, the replay buffer is overwhelmed with failed trajectories, impeding the establishment of an applicable curriculum. To circumvent these limitations, we deviate from existing HER-based methods and propose an on-policy GCRL framework, GCPO, which is applicable to both multi-goal Markovian reward (MR) and NMR problems. GCPO consists of (1) Pre-training from Demonstrations, which pre-trains the policy to possess an initial goal-achieving capability, thereby diminishing the difficulty of subsequent online learning. (2) Online Self-Curriculum Learning, which first estimates the policy's goal-achieving capability based on historical evaluation information and then selects progressively challenging goals for learning based on its current capability. We evaluate GCPO on a challenging multi-goal long-horizon task: fixed-wing UAV velocity vector control. Experimental results demonstrate that GCPO is capable of effectively addressing both multi-goal MR and NMR problems.

PDF Details DOI

ICML Conference 2024 Conference Paper

Iterative Regularized Policy Optimization with Imperfect Demonstrations

Xudong Gong
Dawei Feng
Kele Xu
Yuanzhao Zhai
Chengkang Yao
Weijia Wang
Bo Ding 0001
Huaimin Wang 0001

Imitation learning heavily relies on the quality of provided demonstrations. In scenarios where demonstrations are imperfect and rare, a prevalent approach for refining policies is through online fine-tuning with reinforcement learning, in which a Kullback–Leibler (KL) regularization is often employed to stabilize the learning process. However, our investigation reveals that on the one hand, imperfect demonstrations can bias the online learning process, the KL regularization will further constrain the improvement of online policy exploration. To address the above issues, we propose Iterative Regularized Policy Optimization (IRPO), a framework that involves iterative offline imitation learning and online reinforcement exploration. Specifically, the policy learned online is used to serve as the demonstrator for successive learning iterations, with a demonstration boosting to consistently enhance the quality of demonstrations. Experimental validations conducted across widely used benchmarks and a novel fixed-wing UAV control task consistently demonstrate the effectiveness of IRPO in improving both the demonstration quality and the policy performance. Our code is available at https: //github. com/GongXudong/IRPO.

Details

AAAI Conference 2024 Conference Paper

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Yuanzhao Zhai
Yiying Li
Zijian Gao
Xudong Gong
Kele Xu
Dawei Feng
Ding Bo
Huaimin Wang

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

PDF Details DOI

IJCAI Conference 2019 Conference Paper

A Quantitative Analysis Platform for PD-L1 Immunohistochemistry based on Point-level Supervision Model

Haibo Mi
Kele Xu
Yang Xiang
Yulin He
Dawei Feng
Huaimin Wang
Chun Wu
Yanming Song

Recently, deep learning has witnessed dramatic progress in the medical image analysis field. In the precise treatment of cancer immunotherapy, the quantitative analysis of PD-L1 immunohistochemistry is of great importance. It is quite common that pathologists manually quantify the cell nuclei. This process is very time-consuming and error-prone. In this paper, we describe the development of a platform for PD-L1 pathological image quantitative analysis using deep learning approaches. As point-level annotations can provide a rough estimate of the object locations and classifications, this platform adopts a point-level supervision model to classify, localize, and count the PD-L1 cells nuclei. Presently, this platform has achieved an accurate quantitative analysis of PD-L1 for two types of carcinoma, and it is deployed in one of the first-class hospitals in China.

PDF Details

TAAS Journal 2015 Journal Article

Fault Monitoring with Sequential Matrix Factorization

Dawei Feng
Cecile Germain

For real-world distributed systems, the knowledge component at the core of the MAPE-K loop has to be inferred, as it cannot be realistically assumed to be defined a priori. Accordingly, this paper considers fault monitoring as a latent factors discovery problem. In the context of end-to-end probing, the goal is to devise an efficient sampling policy that makes the best use of a constrained sampling budget. Previous work addresses fault monitoring in a collaborative prediction framework, where the information is a snapshot of the probes outcomes. Here, we take into account the fact that the system dynamically evolves at various time scales. We propose and evaluate Sequential Matrix Factorization (SMF) that exploits both the recent advances in matrix factorization for the instantaneous information and a new sampling heuristics based on historical information. The effectiveness of the SMF approach is exemplified on datasets of increasing difficulty and compared with state of the art history-based or snapshot-based methods. In all cases, strong adaptivity under the specific flavor of active learning is required to unleash the full potential of coupling the most confident and the most uncertain sampling heuristics, which is the cornerstone of SMF.

Details DOI