Arrow Research search

Author name cluster

Yuhui Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

AAAI Conference 2026 Conference Paper

MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning

  • Zhiheng Xi
  • Yuhui Wang
  • Yiwen Ding
  • Guanyu Li
  • Senjie Jin
  • Shichun Liu
  • Jixuan Huang
  • Dingwen Yang

Outcome-based reinforcement learning has made notable advances in training language models (LMs) for reasoning. However, without explicit incentives and controls, this paradigm has limitations and instability in eliciting high-quality reasoning trajectories with diverse actions—particularly for models whose pretraining lacked extensive reasoning-related data. To this end, we introduce MetaAct-RL, a new RL framework that frames LMs’ thinking as sequential decision making over meta-actions. In this framework, the model chooses and executes a high-level action at each step—such as forward reasoning, critique, or refinement—to gradually reach the correct answer. To encourage deeper exploration, richer action diversity, and to improve sampling efficiency in the RL optimization process, MetaAct-RL incorporates appropriate length-based reward and regularization, and a key-state restart mechanism. Extensive experiments across six benchmarks show that MetaAct-RL improves reasoning performance by 7.99 on Llama3.2-1B and 7.17 on Llama3.1-8B relative to vanilla RL method. Moreover, on the challenging AIME-2024, our method outperforms the vanilla RL by 7.5 with Qwen2.5-1.5B.

ECAI Conference 2024 Conference Paper

Few-Shot Object Detection with Instance Feature Generation and Hybrid Contrastive Learning

  • Yuhui Wang
  • Bo Peng 0007
  • Tianyi Qin
  • Jiahui Song
  • Xu Zhang 0045

Few-shot object detection aims to effectively detect novel classes with limited annotated samples. Due to the low-quality instance features obtained by deep learning in data-scarce scenarios, few-shot object detection remains a significant challenge. In this paper, a novel method is proposed to enhance the few-shot object detection performance by focusing on the generalizability and discriminability of instance features. In detail, by synthesizing auxiliary instances that embed diverse attributes, a mask-guided instance feature generation module is presented to alleviate the overfitting on sample-specific characteristics, thereby facilitating the acquisition of generalizable object-relevant knowledge. Then, to learn discriminative object-relevant attributes, a hybrid cross-layer and intra-layer contrastive learning mechanism is designed to enhance the discriminability of instance features by building contrastive constraints between instances within and across layers. Experimental results on two widely used benchmarks demonstrate the effectiveness of the proposed method.

EWRL Workshop 2024 Workshop Paper

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

  • Yuhui Wang
  • Qingyuan Wu
  • Weida Li
  • Dylan R. Ashley
  • Francesco Faccio
  • Chao Huang
  • Jürgen Schmidhuber

The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent MDP for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task which typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introducing an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on both 2D maze navigation environments and the ViZDoom 3D navigation benchmark. We find that our new method, named Dynamic Transition VIN (DT-VIN), easily scales to 5000 layers and casually solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.

NeurIPS Conference 2024 Conference Paper

Variational Delayed Policy Optimization

  • Qingyuan Wu
  • Simon S. Zhan
  • Yixuan Wang
  • Yuhui Wang
  • Chung-Wei Lin
  • Chen Lv
  • Qi Zhu
  • Chao Huang

In environments with delayed observation, state augmentation by including actions within the delay window is adopted to retrieve Markovian property to enable reinforcement learning (RL). Whereas, state-of-the-art (SOTA) RL techniques with Temporal-Difference (TD) learning frameworks commonly suffer from learning inefficiency, due to the significant expansion of the augmented state space with the delay. To improve the learning efficiency without sacrificing performance, this work novelly introduces Variational Delayed Policy Optimization (VDPO), reforming delayed RL as a variational inference problem. This problem is further modelled as a two-step iterative optimization problem, where the first step is TD learning in the delay-free environment with a small state space, and the second step is behaviour cloning which can be addressed much more efficiently than TD learning. We not only provide a theoretical analysis of VDPO in terms of sample complexity and performance, but also empirically demonstrate that VDPO can achieve consistent performance with SOTA methods, with a significant enhancement of sample efficiency (approximately 50\% less amount of samples) in the MuJoCo benchmark.

EAAI Journal 2023 Journal Article

Adaptive scalable spatio-temporal graph convolutional network for PM2.5 prediction

  • Qingjian Ni
  • Yuhui Wang
  • Jiayi Yuan

PM2. 5 Prediction is a complex task of large-scale spatio-temporal analysis, which not only needs comprehension of static geospatial knowledge and relative features but also needs to analyze the real-time situation. This paper discusses the characteristics of the static graph and the dynamic graph in spatio-temporal series tasks. An Adaptive Scalable Spatio-temporal Graph Convolutional Network(ASGCN) model is proposed to predict PM2. 5. To capture and analyze the characteristics of the time series period of PM2. 5, a time convolution network based on the strategies of inception and gating is proposed and used as a temporal module. A dynamic graph idea is adopted to distinguish the spatio-temporal similarity of different periods. And an adaptive weighted multilayer graph convolution network is used to process static and dynamic graphs, aiming to analyze the spatial relationship of PM2. 5 stations. The convolution network with the inception and gating improves the time-series feature capture ability, and adaptive static and dynamic graphs enhance the spatial relationship analysis ability. The temporal and spatial modules of the model are relatively independent, which benefits obtaining the potential information of datasets to improve the prediction accuracy. At the same time, these modules cooperate to make the model adaptable to various data. We choose a great number of comparative models and design a thorough experimental scheme including single-step prediction, multi-step prediction, hyperparameter experiments, and ablation experiments on two real PM2. 5 datasets collected in China. Finally, the model achieves performance close to or better than the current state-of-the-art models selected for comparison in prediction tasks.

ICLR Conference 2023 Conference Paper

DropIT: Dropping Intermediate Tensors for Memory-Efficient DNN Training

  • Joya Chen
  • Kai Xu
  • Yuhui Wang
  • Yifei Cheng 0002
  • Angela Yao

A standard hardware bottleneck when training deep neural networks is GPU memory. The bulk of memory is occupied by caching intermediate tensors for gradient computation in the backward pass. We propose a novel method to reduce this footprint - Dropping Intermediate Tensors (DropIT). DropIT drops min-k elements of the intermediate tensors and approximates gradients from the sparsified tensors in the backward pass. Theoretically, DropIT reduces noise on estimated gradients and therefore has a higher rate of convergence than vanilla-SGD. Experiments show that we can drop up to 90\% of the intermediate tensor elements in fully-connected and convolutional layers while achieving higher testing accuracy for Visual Transformers and Convolutional Neural Networks on various tasks (e.g., classification, object detection, instance segmentation). Our code and models are available at https://github.com/chenjoya/dropit.

AAAI Conference 2021 Conference Paper

Deep Recurrent Belief Propagation Network for POMDPs

  • Yuhui Wang
  • Xiaoyang Tan

In many real-world sequential decision-making tasks, especially in continuous control like robotic control, it is rare that the observations are perfect, that is, the sensory data could be incomplete, noisy or even dynamically polluted due to the unexpected malfunctions or intrinsic low quality of the sensors. Previous methods handle these issues in the framework of POMDPs and are either deterministic by feature memorization or stochastic by belief inference. In this paper, we present a new method that lies somewhere in the middle of the spectrum of research methodology identified above and combines the strength of both approaches. In particular, the proposed method, named Deep Recurrent Belief Propagation Network (DRBPN), takes a hybrid style belief updating procedure − an RNN-type feature extraction step followed by an analytical belief inference, significantly reducing the computational cost while faithfully capturing the complex dynamics and maintaining the necessary uncertainty for generalization. The effectiveness of the proposed method is verified on a collection of benchmark tasks, showing that our approach outperforms several state-of-the-art methods under various challenging scenarios.

AAAI Conference 2020 Conference Paper

SMIX(λ): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning

  • Chao Wen
  • Xinghu Yao
  • Yuhui Wang
  • Xiaoyang Tan

This work presents a sample efficient and effective valuebased method, named SMIX(λ), for reinforcement learning in multi-agent environments (MARL) within the paradigm of centralized training with decentralized execution (CTDE), in which learning a stable and generalizable centralized value function (CVF) is crucial. To achieve this, our method carefully combines different elements, including 1) removing the unrealistic centralized greedy assumption during the learning phase, 2) using the λ-return to balance the trade-off between bias and variance and to deal with the environment’s non- Markovian property, and 3) adopting an experience-replay style off-policy training. Interestingly, it is revealed that there exists inherent connection between SMIX(λ) and previous off-policy Q(λ) approach for single-agent learning. Experiments on the StarCraft Multi-Agent Challenge (SMAC) benchmark show that the proposed SMIX(λ) algorithm outperforms several state-of-the-art MARL methods by a large margin, and that it can be used as a general tool to improve the overall performance of a CTDE-type method by enhancing the evaluation quality of its CVF. We open-source our code at: https: //github. com/chaovven/SMIX.

NeurIPS Conference 2019 Conference Paper

Trust Region-Guided Proximal Policy Optimization

  • Yuhui Wang
  • Hao He
  • Xiaoyang Tan
  • Yaozhong Gan

Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well. Extensive experiments verify the advantage of the proposed method.