Qining Zhang Papers

ICLR Conference 2025 Conference Paper

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Qining Zhang
Lei Ying 0001

Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning Large Language Models (LLMs). In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.

Details

RLC Conference 2024 Conference Paper

Cost Aware Best Arm Identification

Kellen Kanarios
Qining Zhang
Lei Ying

In this paper, we study a best arm identification problem with dual objects. In addition to the classic reward, each arm is associated with a cost distribution and the goal is to identify the largest reward arm using the minimum expected cost. We call it Cost Aware Best Arm Identification (CABAI), which captures the separation of testing and implementation phases in product development pipelines and models the objective shift between phases, i. e. , cost for testing and reward for implementation. We first derive an theoretic lower bound for CABAI and propose an algorithm called $\mathsf{CTAS}$ to match it asymptotically. To reduce the computation of $\mathsf{CTAS}$, we further propose a low-complexity algorithm called CO, based on a square-root rule, which proves optimal in simplified two-armed models and generalizes surprisingly well in numerical experiments. Our results show (i) ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii) low-complexity algorithms deliver near-optimal performance over a wide range of problems.

PDF Details

RLJ Journal 2024 Journal Article

Cost Aware Best Arm Identification

Kellen Kanarios
Qining Zhang
Lei Ying

In this paper, we study a best arm identification problem with dual objects. In addition to the classic reward, each arm is associated with a cost distribution and the goal is to identify the largest reward arm using the minimum expected cost. We call it Cost Aware Best Arm Identification (CABAI), which captures the separation of testing and implementation phases in product development pipelines and models the objective shift between phases, i.e., cost for testing and reward for implementation. We first derive an theoretic lower bound for CABAI and propose an algorithm called $\mathsf{CTAS}$ to match it asymptotically. To reduce the computation of $\mathsf{CTAS}$, we further propose a low-complexity algorithm called CO, based on a square-root rule, which proves optimal in simplified two-armed models and generalizes surprisingly well in numerical experiments. Our results show (i) ignoring the heterogeneous action cost results in sub-optimality in practice, and (ii) low-complexity algorithms deliver near-optimal performance over a wide range of problems.

PDF Details

AAAI Conference 2024 Conference Paper

Deep Reinforcement Learning for Early Diagnosis of Lung Cancer

Yifan Wang
Qining Zhang
Lei Ying
Chuan Zhou

Lung cancer remains the leading cause of cancer-related death worldwide, and early diagnosis of lung cancer is critical for improving the survival rate of patients. Performing annual low-dose computed tomography (LDCT) screening among high-risk populations is the primary approach for early diagnosis. However, after each screening, whether to continue monitoring (with follow-up screenings) or to order a biopsy for diagnosis remains a challenging decision to make. Continuing with follow-up screenings may lead to delayed diagnosis but ordering a biopsy without sufficient evidence incurs unnecessary risk and cost. In this paper, we tackle the problem by an optimal stopping approach. Our proposed algorithm, called EarlyStop-RL, utilizes the structure of the Snell envelope for optimal stopping, and model-free deep reinforcement learning for making diagnosis decisions. Through evaluating our algorithm on a commonly used clinical trial dataset (the National Lung Screening Trial), we demonstrate that EarlyStop-RL has the potential to greatly enhance risk assessment and early diagnosis of lung cancer, surpassing the performance of two widely adopted clinical models, namely the Lung-RADS and the Brock model.

PDF Details DOI

RLJ Journal 2024 Journal Article

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Qining Zhang
Honghao Wei
Lei Ying

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

PDF Details

RLC Conference 2024 Conference Paper

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Qining Zhang
Honghao Wei
Lei Ying

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

PDF Details

NeurIPS Conference 2023 Conference Paper

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms

Qining Zhang
Lei Ying

This paper considers a stochastic Multi-Armed Bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of $T$ consecutive rounds. Though each objective has been individually well-studied, i. e. , best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present an algorithm called EOCP and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in $\mathcal{O}(\log T)$ rounds with pre-determined stopping time and $\mathcal{O}(\log^2 T)$ rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to the sample complexity) of ROBAI, showing that EOCP and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting ``over-exploration'' phenomenon carried by classic UCB algorithms, such that EOCP has smaller regret even though it stops exploration much earlier than UCB, i. e. , $\mathcal{O}(\log T)$ versus $\mathcal{O}(T)$, which suggests over-exploration is unnecessary and potentially harmful to system performance.

PDF Details

Possible papers

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Cost Aware Best Arm Identification

Cost Aware Best Arm Identification

Deep Reinforcement Learning for Early Diagnosis of Lung Cancer

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms