Author name cluster

Huaimin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

AAAI Conference 2025 Conference Paper

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Yuanzhao Zhai
Tingkai Yang
Kele Xu
Dawei Feng
Cheng Yang
Bo Ding
Huaimin Wang

Agents significantly enhance the capabilities of standalone Large Language Models (LLMs) by perceiving environments, making decisions, and executing actions. However, LLM agents still face challenges in tasks that require multiple decision-making steps. Estimating the value of actions in specific tasks is difficult when intermediate actions are neither appropriately rewarded nor penalized. In this paper, we propose leveraging a task-relevant Q-value model to guide action selection. Specifically, we first collect decision-making trajectories annotated with step-level Q values via Monte Carlo Tree Search (MCTS) and construct preference data. We then use another LLM to fit these preferences through step-level Direct Policy Optimization (DPO), which serves as the Q-value model. During inference, at each decision-making step, LLM agents select the action with the highest Q value before interacting with the environment. We apply our method to various open-source and API-based LLM agents, demonstrating that Q-value models significantly improve their performance. Notably, the performance of the agent built with Phi-3-mini-4k-instruct improved by 103% on WebShop and 75% on HotPotQA when enhanced with Q-value models, even surpassing GPT-4o-mini. Additionally, Q-value models offer several advantages, such as generalization to different LLM agents and seamless integration with existing prompting strategies.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Maintaining Fairness in Logit-based Knowledge Distillation for Class-Incremental Learning

Zijian Gao
Shanhao Han
Xingxing Zhang
Kele Xu
Dulan Zhou
Xinjun Mao
Yong Dou
Huaimin Wang

Logit-based knowledge distillation (KD) is commonly used to mitigate catastrophic forgetting in class-incremental learning (CIL) caused by data distribution shifts. However, the strict match of logit values between student and teacher models conflicts with the cross-entropy (CE) loss objective of learning new classes, leading to significant recency bias (i.e. unfairness). To address this issue, we rethink the overlooked limitations of KD-based methods through empirical analysis. Inspired by our findings, we introduce a plug-and-play pre-process method that normalizes the logits of both the student and teacher across all classes, rather than just the old classes, before distillation. This approach allows the student to focus on both old and new classes, capturing intrinsic inter-class relations from the teacher. By doing so, our method avoids the inherent conflict between KD and CE, maintaining fairness between old and new classes. Additionally, recognizing that overconfident teacher predictions can hinder the transfer of inter-class relations (i.e., dark knowledge), we extend our method to capture intra-class relations among different instances, ensuring fairness within old classes. Our method integrates seamlessly with existing logit-based KD approaches, consistently enhancing their performance across multiple CIL benchmarks without incurring additional training costs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Goal-Conditioned On-Policy Reinforcement Learning

Xudong Gong
Dawei Feng
Kele Xu
Bo Ding
Huaimin Wang

Existing Goal-Conditioned Reinforcement Learning (GCRL) algorithms are built upon Hindsight Experience Replay (HER), which densifies rewards through hindsight replay and leverages historical goal-achieving information to construct a learning curriculum. However, when the task is characterized by a non-Markovian reward (NMR), whose computation depends on multiple steps of states and actions, HER can no longer densify rewards by treating a single encountered state as the hindsight goal. The lack of informative rewards hinders policy learning, resulting in rolling out failed trajectories. Consequently, the replay buffer is overwhelmed with failed trajectories, impeding the establishment of an applicable curriculum. To circumvent these limitations, we deviate from existing HER-based methods and propose an on-policy GCRL framework, GCPO, which is applicable to both multi-goal Markovian reward (MR) and NMR problems. GCPO consists of (1) Pre-training from Demonstrations, which pre-trains the policy to possess an initial goal-achieving capability, thereby diminishing the difficulty of subsequent online learning. (2) Online Self-Curriculum Learning, which first estimates the policy's goal-achieving capability based on historical evaluation information and then selects progressively challenging goals for learning based on its current capability. We evaluate GCPO on a challenging multi-goal long-horizon task: fixed-wing UAV velocity vector control. Experimental results demonstrate that GCPO is capable of effectively addressing both multi-goal MR and NMR problems.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Yuanzhao Zhai
Yiying Li
Zijian Gao
Xudong Gong
Kele Xu
Dawei Feng
Ding Bo
Huaimin Wang

Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards, and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Stabilizing Zero-Shot Prediction: A Novel Antidote to Forgetting in Continual Vision-Language Tasks

Zijian Gao
Xingxing Zhang
Kele Xu
Xinjun Mao
Huaimin Wang

Continual learning (CL) empowers pre-trained vision-language (VL) models to efficiently adapt to a sequence of downstream tasks. However, these models often encounter challenges in retaining previously acquired skills due to parameter shifts and limited access to historical data. In response, recent efforts focus on devising specific frameworks and various replay strategies, striving for a typical learning-forgetting trade-off. Surprisingly, both our empirical research and theoretical analysis demonstrate that the stability of the model in consecutive zero-shot predictions serves as a reliable indicator of its anti-forgetting capabilities for previously learned tasks. Motivated by these insights, we develop a novel replay-free CL method named ZAF (Zero-shot Antidote to Forgetting), which preserves acquired knowledge through a zero-shot stability regularization applied to wild data in a plug-and-play manner. To enhance efficiency in adapting to new tasks and seamlessly access historical models, we introduce a parameter-efficient EMA-LoRA neural architecture based on the Exponential Moving Average (EMA). ZAF utilizes new data for low-rank adaptation (LoRA), complemented by a zero-shot antidote on wild data, effectively decoupling learning from forgetting. Our extensive experiments demonstrate ZAF's superior performance and robustness in pre-trained models across various continual VL concept learning tasks, achieving leads of up to 3. 70\%, 4. 82\%, and 4. 38\%, along with at least a 10x acceleration in training speed on three benchmarks, respectively. Additionally, our zero-shot antidote significantly reduces forgetting in existing models by at least 6. 37\%. Our code is available at https: //github. com/Zi-Jian-Gao/Stabilizing-Zero-Shot-Prediction-ZAF.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

Boqing Zhu
Kele Xu
Changjian Wang
Zheng Qin
Tao Sun
Huaimin Wang
Yuxing Peng

We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

Wei Zhou
Yiying Li
Yongxin Yang
Huaimin Wang
Timothy Hospedales

Off-Policy Actor-Critic (OffP-AC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a flexible and augmented meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning benefits to a variety of continuous control tasks when combined with contemporary OffP-AC methods DDPG, TD3 and SAC.

PDF Details

IJCAI Conference 2019 Conference Paper

A Mobile Application for Sound Event Detection

Yingwei Fu
Kele Xu
Haibo Mi
Huaimin Wang
Dezhi Wang
Boqing Zhu

Sound event detection is intended to analyze and recognize the sound events in audio streams and it has widespread applications in real life. Recently, deep neural networks such as convolutional recurrent neural networks have shown state-of-the-art performance in this task. However, the previous methods were designed and implemented on devices with rich computing resources, and there are few applications on mobile devices. This paper focuses on the solution on the mobile platform for sound event detection. The architecture of the solution includes offline training and online detection. During offline training process, multi model-based distillation method is used to compress model to enable real-time detection. The online detection process includes acquisition of sensor data, processing of audio signals, and detecting and recording of sound events. Finally, we implement an application on the mobile device that can detect sound events in near real time.

PDF Details

IJCAI Conference 2019 Conference Paper

A Quantitative Analysis Platform for PD-L1 Immunohistochemistry based on Point-level Supervision Model

Haibo Mi
Kele Xu
Yang Xiang
Yulin He
Dawei Feng
Huaimin Wang
Chun Wu
Yanming Song

Recently, deep learning has witnessed dramatic progress in the medical image analysis field. In the precise treatment of cancer immunotherapy, the quantitative analysis of PD-L1 immunohistochemistry is of great importance. It is quite common that pathologists manually quantify the cell nuclei. This process is very time-consuming and error-prone. In this paper, we describe the development of a platform for PD-L1 pathological image quantitative analysis using deep learning approaches. As point-level annotations can provide a rough estimate of the object locations and classifications, this platform adopts a point-level supervision model to classify, localize, and count the PD-L1 cells nuclei. Presently, this platform has achieved an accurate quantitative analysis of PD-L1 for two types of carcinoma, and it is deployed in one of the first-class hospitals in China.

PDF Details

IJCAI Conference 2015 Conference Paper

Data Sparseness in Linear SVM

Xiang Li
Huaimin Wang
Bin Gu
Charles X. Ling

Large sparse datasets are common in many realworld applications. Linear SVM has been shown to be very efficient for classifying such datasets. However, it is still unknown how data sparseness would affect its convergence behavior. To study this problem in a systematic manner, we propose a novel approach to generate large and sparse data from real-world datasets, using statistical inference and the data sampling process in the PAC framework. We first study the convergence behavior of linear SVM experimentally, and make several observations, useful for real-world applications. We then offer theoretical proofs for our observations by studying the Bayes risk and PAC bound. Our experiment and theoretic results are valuable for learning large sparse datasets with linear SVM.

PDF Details