Author name cluster

Zhengling Qi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

A Principled Path to Fitted Distributional Evaluation

Sungee Hong
Jiayi Wang
Zhengling Qi
Raymond K. W. Wong

In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation---developed for expectation-based reinforcement learning---to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.

PDF Details

NeurIPS Conference 2025 Conference Paper

Breaking the Order Barrier: Off-Policy Evaluation for Confounded POMDPs

Qi Kuang
Jiayi Wang
Fan Zhou
Zhengling Qi

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs) with unobserved confounding. Recent advances have introduced bridge-function to circumvent unmeasured confounding and develop estimators for the policy value, yet the statistical error bounds of them related to the length of horizon $T$ and the size of the state-action space $|\mathcal{O}||\mathcal{A}|$ remain largely unexplored. In this paper, we systematically investigate the finite-sample error bounds of OPE estimators in finite-horizon tabular confounded POMDPs. Specifically, we show that under certain rank conditions, the estimation error for policy value can achieve a rate of $\mathcal{O}(T^{1. 5}/\sqrt{n})$, excluding the cardinality of the observation space $|\mathcal{O}|$ and the action space $|\mathcal{A}|$. With an additional mild condition on the concentrability coefficients in confounded POMDPs, the rate of estimation error can be improved to $\mathcal{O}(T/\sqrt{n})$. We also show that for a fully history-dependent policy, the estimation error scales as $\mathcal{O}\big(T/\sqrt{n}(|\mathcal{O}| |\mathcal{A}|)^{\frac{T}{2}}\big)$, highlighting the exponential error dependence introduced by history-based proxies to infer hidden states. Furthermore, when the target policy is memoryless policy, the error bound improves to $\mathcal{O}\big(T/\sqrt{n}\sqrt{|\mathcal{O}| |\mathcal{A}|}\big)$, which matches the optimal rate known for tabular MDPs. To the best of our knowledge, this is the first work to provide a comprehensive finite-sample analysis of OPE in confounded POMDPs.

PDF Details

ICML Conference 2024 Conference Paper

A Fine-grained Analysis of Fitted Q-evaluation: Beyond Parametric Models

Jiayi Wang
Zhengling Qi
Raymond K. W. Wong

In this paper, we delve into the statistical analysis of the fitted Q-evaluation (FQE) method, which focuses on estimating the value of a target policy using offline data generated by some behavior policy. We provide a comprehensive theoretical understanding of FQE estimators under both parametric and non-parametric models on the Q-function. Specifically, we address three key questions related to FQE that remain largely unexplored in the current literature: (1) Is the optimal convergence rate for estimating the policy value regarding the sample size $n$ ($n^{−1/2}$) achievable for FQE under a nonparametric model with a fixed horizon ($T$ )? (2) How does the error bound depend on the horizon T? (3) What is the role of the probability ratio function in improving the convergence of FQE estimators? Specifically, we show that under the completeness assumption of Q-functions, which is mild in the non-parametric setting, the estimation errors for policy value using both parametric and non-parametric FQE estimators can achieve an optimal rate in terms of n. The corresponding error bounds in terms of both $n$ and $T$ are also established. With an additional realizability assumption on ratio functions, the rate of estimation errors can be improved from $T^{ 1. 5}/\sqrt{n}$ to $T /\sqrt{n}$, which matches the sharpest known bound in the current literature under the tabular setting.

Details

ICLR Conference 2024 Conference Paper

A Policy Gradient Method for Confounded POMDPs

Mao Hong
Zhengling Qi
Yanxun Xu

In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.

Details

ICML Conference 2024 Conference Paper

Model-based Reinforcement Learning for Confounded POMDPs

Mao Hong
Zhengling Qi
Yanxun Xu

We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially.

Details

NeurIPS Conference 2024 Conference Paper

Two-way Deconfounder for Off-policy Evaluation in Causal Reinforcement Learning

Shuguang Yu
Shuxing Fang
Ruixin Peng
Zhengling Qi
Fan Zhou
Chengchun Shi

This paper studies off-policy evaluation (OPE) in the presence of unmeasured confounders. Inspired by the two-way fixed effects regression model widely used in the panel data literature, we propose a two-way unmeasured confounding assumption to model the system dynamics in causal reinforcement learning and develop a two-way deconfounder algorithm that devises a neural tensor network to simultaneously learn both the unmeasured confounders and the system dynamics, based on which a model-based estimator can be constructed for consistent policy value estimation. We illustrate the effectiveness of the proposed estimator through theoretical results and numerical experiments.

PDF Details DOI

ICML Conference 2023 Conference Paper

PASTA: Pessimistic Assortment Optimization

Juncheng Dong
Weibin Mo
Zhengling Qi
Cong Shi 0001
Ethan X. Fang
Vahid Tarokh

We consider a fundamental class of assortment optimization problems in an offline data-driven setting. The firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, based on the principle of pessimism, we propose a novel algorithm called Pessimistic ASsortment opTimizAtion (PASTA for short), which can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish the first regret bound for the offline assortment optimization problem under the celebrated multinomial logit model (MNL). We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Our numerical studies demonstrate the superiority of the proposed method over the existing baseline method.

Details

UAI Conference 2023 Conference Paper

Pessimistic Model Selection for Offline Deep Reinforcement Learning

Chao-Han Huck Yang
Zhengling Qi
Yifan Cui 0001
Pin-Yu Chen

Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.

Details

NeurIPS Conference 2022 Conference Paper

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

Rui Miao
Zhengling Qi
Xiaoke Zhang

We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV estimation, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.

PDF Details

ICML Conference 2022 Conference Paper

On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation

Xiaohong Chen
Zhengling Qi

We study the off-policy evaluation (OPE) problem in an infinite-horizon Markov decision process with continuous states and actions. We recast the $Q$-function estimation into a special form of the nonparametric instrumental variables (NPIV) estimation problem. We first show that under one mild condition the NPIV formulation of $Q$-function estimation is well-posed in the sense of $L^2$-measure of ill-posedness with respect to the data generating distribution, bypassing a strong assumption on the discount factor $\gamma$ imposed in the recent literature for obtaining the $L^2$ convergence rates of various $Q$-function estimators. Thanks to this new well-posed property, we derive the first minimax lower bounds for the convergence rates of nonparametric estimation of $Q$-function and its derivatives in both sup-norm and $L^2$-norm, which are shown to be the same as those for the classical nonparametric regression (Stone, 1982). We then propose a sieve two-stage least squares estimator and establish its rate-optimality in both norms under some mild conditions. Our general results on the well-posedness and the minimax lower bounds are of independent interest to study not only other nonparametric estimators for $Q$-function but also efficient estimation on the value of any target policy in off-policy settings.

Details

NeurIPS Conference 2022 Conference Paper

RISE: Robust Individualized Decision Learning with Sensitive Variables

Xiaoqing Tan
Zhengling Qi
Christopher Seymour
Lu Tang

This paper introduces RISE, a robust individualized decision learning framework with sensitive variables, where sensitive variables are collectible data and important to the intervention decision, but their inclusion in decision making is prohibited due to reasons such as delayed availability or fairness concerns. A naive baseline is to ignore these sensitive variables in learning decision rules, leading to significant uncertainty and bias. To address this, we propose a decision learning framework to incorporate sensitive variables during offline training but not include them in the input of the learned decision rule during model deployment. Specifically, from a causal perspective, the proposed framework intends to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing literature that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-world applications.

PDF Details