Arrow Research search

Author name cluster

Honghao Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

Safe Reinforcement Learning for Trustworthy AI: Theory, Algorithms, and Applications

  • Honghao Wei

Safe reinforcement learning (RL) has emerged as a key paradigm for deploying AI in high-stakes domains such as autonomous driving, robotics, healthcare, and recommender systems. By embedding constraints into the learning process, safe RL enables agents to optimize performance while satisfying critical requirements, including collision avoidance, resource limits, and system reliability. Such guarantees are indispensable for real-world AI, where failures can cause physical harm, economic loss, or loss of trust. At the same time, demand for trustworthy AI continues to grow as machine learning is increasingly deployed in human-centered applications. This makes it essential to design RL algorithms that are not only efficient but also reliable, robust, and aligned with societal needs.

TMLR Journal 2026 Journal Article

Towards Fast Safe Online Reinforcement Learning via Policy Finetuning

  • Keru Chen
  • Honghao Wei
  • Zhigang Deng
  • Sen Lin

High costs and risks involved in extensive environmental interactions hinder the practical application of current online safe reinforcement learning (RL) methods. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online learning, a direction that has yet to be fully investigated. To fill this gap, we first show that naively applying existing O2O algorithms from standard RL would not work well in safe RL due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, the first policy-finetuning based framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the learned Q-functions with the online objective before finetuning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during finetuning. Extensive experiments demonstrate the superior performance of Marvel over related baselines.

ICML Conference 2025 Conference Paper

An Optimistic Algorithm for online CMDPS with Anytime Adversarial Constraints

  • Jiahui Zhu
  • Kihyun Yu
  • Dabeen Lee
  • Xin Liu 0049
  • Honghao Wei

Online safe reinforcement learning (RL) plays a key role in dynamic environments, with applications in autonomous driving, robotics, and cybersecurity. The objective is to learn optimal policies that maximize rewards while satisfying safety constraints modeled by constrained Markov decision processes (CMDPs). Existing methods achieve sublinear regret under stochastic constraints but often fail in adversarial settings, where constraints are unknown, time-varying, and potentially adversarially designed. In this paper, we propose the Optimistic Mirror Descent Primal-Dual (OMDPD) algorithm, the first to address online CMDPs with anytime adversarial constraints. OMDPD achieves optimal regret $\tilde{\mathcal{O}}(\sqrt{K})$ and strong constraint violation $\tilde{\mathcal{O}}(\sqrt{K})$ without relying on Slater’s condition or the existence of a strictly known safe policy. We further show that access to accurate estimates of rewards and transitions can further improve these bounds. Our results offer practical guarantees for safe decision-making in adversarial environments.

AAAI Conference 2025 Conference Paper

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning

  • Yassine Chemingui
  • Aryan Deshwal
  • Honghao Wei
  • Alan Fern
  • Jana Doppa

Offline safe reinforcement learning (OSRL) involves learning a decision-making policy to maximize rewards from a fixed batch of training data to satisfy pre-defined safety constraints. However, adapting to varying safety constraints during deployment without retraining remains an under-explored challenge. To address this challenge, we introduce constraint-adaptive policy switching (CAPS), a wrapper framework around existing offline RL algorithms. During training, CAPS uses offline data to learn multiple policies with a shared representation that optimize different reward and cost trade-offs. During testing, CAPS switches between those policies by selecting at each state the policy that maximizes future rewards among those that satisfy the current cost constraint. Our experiments on 38 tasks from the DSRL benchmark demonstrate that CAPS consistently outperforms existing methods, establishing a strong wrapper-based baseline for OSRL.

NeurIPS Conference 2025 Conference Paper

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

  • Xiyue Peng
  • Hengquan Guo
  • Jiawei Zhang
  • Dongqing Zou
  • Ziyu Shao
  • Honghao Wei
  • Xin Liu

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed "safety compensation'', where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe. To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the expected safety constraint with critical safety constraints imposed on every prompt. At the core of RePO is a policy update mechanism driven by rectified policy gradients, which penalizes the strict safety violation of every prompt, thereby enhancing safety across nearly all prompts. Our experiments demonstrate that RePO outperforms strong baseline methods and significantly enhances LLM safety alignment.

AAAI Conference 2025 Conference Paper

HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection

  • Zijian Gu
  • Jianwei Ma
  • Yan Huang
  • Honghao Wei
  • Zhanye Chen
  • Hui Zhang
  • Wei Hong

Millimeter-wave radar plays a vital role in 3D object detection for autonomous driving due to its all-weather and all-lighting-condition capabilities for perception. However, radar point clouds suffer from pronounced sparsity and unavoidable angle estimation errors. To address these limitations, incorporating a camera may partially help mitigate the shortcomings. Nevertheless, the direct fusion of radar and camera data can lead to negative or even opposite effects due to the lack of depth information in images and low-quality image features under adverse lighting conditions. Hence, in this paper, we present the radar-camera fusion network with Hybrid Generation and Synchronization (HGSFusion), designed to better fuse radar potentials and image features for 3D object detection. Specifically, we propose the Radar Hybrid Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA) estimation errors in radar signal processing. This module generates denser radar points through different Probability Density Functions (PDFs) with the assistance of semantic information. Meanwhile, we introduce the Dual Sync Module (DSM), comprising spatial sync and modality sync, to enhance image features with radar positional information and facilitate the fusion of distinct characteristics in different modalities. Extensive experiments demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods in the VoD and TJ4DRadSet datasets by 6.53% and 2.03% in RoI AP and BEV AP, respectively.

NeurIPS Conference 2024 Conference Paper

Adversarially Trained Weighted Actor-Critic for Safe Offline Reinforcement Learning

  • Honghao Wei
  • Xiyue Peng
  • Arnob Ghosh
  • Xin Liu

We propose WSAC (Weighted Safe Actor-Critic), a novel algorithm for Safe Offline Reinforcement Learning (RL) under functional approximation, which can robustly optimize policies to improve upon an arbitrary reference policy with limited data coverage. WSAC is designed as a two-player Stackelberg game to optimize a refined objective function. The actor optimizes the policy against two adversarially trained value critics with small importance-weighted Bellman errors, which focus on scenarios where the actor's performance is inferior to the reference policy. In theory, we demonstrate that when the actor employs a no-regret optimization oracle, WSAC achieves a number of guarantees: $(i)$ For the first time in the safe offline RL setting, we establish that WSAC can produce a policy that outperforms {\bf any} reference policy while maintaining the same level of safety, which is critical to designing a safe algorithm for offline RL. $(ii)$ WSAC achieves the optimal statistical convergence rate of $1/\sqrt{N}$ to the reference policy, where $N$ is the size of the offline dataset. $(iii)$ We theoretically show that WSAC guarantees a safe policy improvement across a broad range of hyperparameters that control the degree of pessimism, indicating its practical robustness. Additionally, we offer a practical version of WSAC and compare it with existing state-of-the-art safe offline RL algorithms in several continuous control environments. WSAC outperforms all baselines across a range of tasks, supporting the theoretical results.

RLJ Journal 2024 Journal Article

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

  • Qining Zhang
  • Honghao Wei
  • Lei Ying

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

RLC Conference 2024 Conference Paper

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

  • Qining Zhang
  • Honghao Wei
  • Lei Ying

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

NeurIPS Conference 2024 Conference Paper

Safe and Efficient: A Primal-Dual Method for Offline Convex CMDPs under Partial Data Coverage

  • Haobo Zhang
  • Xiyue Peng
  • Honghao Wei
  • Xin Liu

Offline safe reinforcement learning (RL) aims to find an optimal policy using a pre-collected dataset when data collection is impractical or risky. We propose a novel linear programming (LP) based primal-dual algorithm for convex MDPs that incorporates ``uncertainty'' parameters to improve data efficiency while requiring only partial data coverage assumption. Our theoretical results achieve a sample complexity of $\mathcal{O}(1/(1-\gamma)\sqrt{n})$ under general function approximation, improving the current state-of-the-art by a factor of $1/(1-\gamma)$, where $n$ is the number of data samples in an offline dataset, and $\gamma$ is the discount factor. The numerical experiments validate our theoretical findings, demonstrating the practical efficacy of our approach in achieving improved safety and learning efficiency in safe offline settings.

AAAI Conference 2024 Conference Paper

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

  • Honghao Wei
  • Xin Liu
  • Lei Ying

This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i) the RL agent knows a safe action set for every state or knows a safe graph in which all the state-action-state triples are safe, and (ii) the constraint/cost functions are linear. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i) and generalize (ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves O(√{d³H⁴K}) regret and O(H √{dK}) hard constraint violation when the cost function is linear and O(H?ₖ √{K}) hard constraint violation when the cost function belongs to RKHS. Here K is the learning horizon, H is the length of each episode, and?ₖ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon K, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.

NeurIPS Conference 2023 Conference Paper

Sample Efficient Reinforcement Learning in Mixed Systems through Augmented Samples and Its Applications to Queueing Networks

  • Honghao Wei
  • Xin Liu
  • Weina Wang
  • Lei Ying

This paper considers a class of reinforcement learning problems, which involve systems with two types of states: stochastic and pseudo-stochastic. In such systems, stochastic states follow a stochastic transition kernel while the transitions of pseudo-stochastic states are deterministic {\em given} the stochastic states/transitions. We refer to such systems as mixed systems, which are widely used in various applications, including Manufacturing systems, communication networks, and queueing networks. We propose a sample-efficient RL method that accelerates learning by generating augmented data samples. The proposed algorithm is data-driven (model-free), but it learns the policy from data samples from both real and augmented samples. This method significantly improves learning by reducing the sample complexity such that the dataset only needs to have sufficient coverage of the stochastic states. We analyze the sample complexity of the proposed method under Fitted Q Iteration (FQI) and demonstrate that the optimality gap decreases as $O\left(\sqrt{\frac{1}{n}}+\sqrt{\frac{1}{m}}\right), $ where $n$ represents the number of real samples, and $m$ is the number of augmented samples per real sample. It is important to note that without augmented samples, the optimality gap is $O(1)$ due to the insufficient data coverage of the pseudo-stochastic states. Our experimental results on multiple queueing network applications confirm that the proposed method indeed significantly accelerates both deep Q-learning and deep policy gradient.

AAAI Conference 2022 Conference Paper

A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes

  • Honghao Wei
  • Xin Liu
  • Lei Ying

This paper presents a model-free reinforcement learning (RL) algorithm for infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). Considering a learning horizon K, which is sufficiently large, the proposed algorithm achieves Õ √ SAκ δ K 5 6 regret and zero constraint violation, where S is the number of states, A is the number of actions, and κ and δ are two constants independent of the learning horizon K.

NeurIPS Conference 2022 Conference Paper

Online Convex Optimization with Hard Constraints: Towards the Best of Two Worlds and Beyond

  • Hengquan Guo
  • Xin Liu
  • Honghao Wei
  • Lei Ying

This paper considers online convex optimization with hard constraints and analyzes achievable regret and cumulative hard constraint violation (violation for short). The problem distinguishes itself from online convex optimization with soft constraints, where a violation at one round can be compensated/cancelled by a conservative decision at a different round. We propose a RECtified Online Optimization algorithm (RECOO) and consider two settings: fixed constraints and adversarial constraints. Both settings have been considered in the literature. Compared with existing results, {\em RECOO achieves the best of two worlds and beyond. } For the fixed-constraints setting, RECOO achieves $O\left(\sqrt{T}\right)$ regret and $O(1)$ violation, where $T$ is the learning horizon. The best known results in this case are $O(\sqrt{T})$ regret and $O\left(T^{1/4}\right)$ violation. For the adversarial-constraints setting, it guarantees $O(\sqrt{T})$ regret and $O(T^{3/4})$ violation, which match the best existing results. When the loss functions are strongly convex, RECOO can guarantee $O(\log T)$ regret and $O(1)$ violation for fixed constraints, and $O(\log T)$ regret and $O(\sqrt{T\log T})$ violation for adversarial constraints. Both these results are order-wise better than the existing bounds. The regret and violation bounds mentioned above use the best fixed decision in hindsight as the baseline. This paper further considers a dynamic baseline where the comparator sequence is time-varying. This paper shows that RECOO not only improves the existing results in the fixed-constraints setting but also {\em for the first time, } guarantees dynamic regret and violation bounds in the adversarial-constraints setting. Our experiment results confirm that RECOO outperforms several existing algorithms for both fixed and adversarial constraints.