Author name cluster

Lihong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Zhenghao Xu
Qin Lu
Qingru Zhang
Liang Qiu
Ilgee Hong
Changlong Yu
Wenlin Yao
Yao Liu

Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.

NeurIPS Conference 2020 Conference Paper

CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai
Ofir Nachum
Yinlam Chow
Lihong Li
Csaba Szepesvari
Dale Schuurmans

We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

NeurIPS Conference 2020 Conference Paper

Escaping the Gravitational Pull of Softmax

Jincheng Mei
Chenjun Xiao
Bo Dai
Lihong Li
Csaba Szepesvari
Dale Schuurmans

The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization ( softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence ( softmax damping''). Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the \emph{escort} mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.

NeurIPS Conference 2020 Conference Paper

Off-Policy Evaluation via the Regularized Lagrangian

Mengjiao Yang
Ofir Nachum
Bo Dai
Lihong Li
Dale Schuurmans

The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we unify these estimators as regularized Lagrangians of the same linear program. The unification allows us to expand the space of DICE estimators to new alternatives that demonstrate improved performance. More importantly, by analyzing the expanded space of estimators both mathematically and empirically we find that dual solutions offer greater flexibility in navigating the tradeoff between optimization stability and estimation bias, and generally provide superior estimates in practice.

NeurIPS Conference 2019 Conference Paper

A Kernel Loss for Solving the Bellman Equation

Yihao Feng
Lihong Li
Qiang Liu

Value function learning plays a central role in many state-of-the-art reinforcement learning algorithms. Many popular algorithms like Q-learning do not optimize any objective function, but are fixed-point iterations of some variants of Bellman operator that are not necessarily a contraction. As a result, they may easily lose convergence guarantees, as can be observed in practice. In this paper, we propose a novel loss function, which can be optimized using standard gradient-based methods with guaranteed convergence. The key advantage is that its gradient can be easily approximated using sampled transitions, avoiding the need for double samples required by prior algorithms like residual gradient. Our approach may be combined with general function classes such as neural networks, using either on- or off-policy data, and is shown to work reliably and effectively in several benchmarks, including classic problems where standard algorithms are known to diverge.

NeurIPS Conference 2019 Conference Paper

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

Ofir Nachum
Yinlam Chow
Bo Dai
Lihong Li

In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a certain state-action pair normalized by the probability with which the state-action pair appears in the dataset -- can improve accuracy and performance. In this work, we propose an algorithm, DualDICE, for estimating these quantities. In contrast to previous approaches, our algorithm is agnostic to knowledge of the behavior policy (or policies) used to generate the dataset. Furthermore, our algorithm eschews any direct use of importance weights, thus avoiding potential optimization instabilities endemic of previous methods. In addition to providing theoretical guarantees, we present an empirical study of our algorithm applied to off-policy policy evaluation and find that our algorithm significantly improves accuracy compared to existing techniques.

JBHI Journal 2019 Journal Article

Regression Convolutional Neural Network for Automated Pediatric Bone Age Assessment From Hand Radiograph

Xuhua Ren
Tingting Li
Xiujun Yang
Shuai Wang
Sahar Ahmad
Lei Xiang
Shaun Richard Stone
Lihong Li

Skeletal bone age assessment is a common clinical practice to investigate endocrinology, and genetic and growth disorders of children. However, clinical interpretation and bone age analyses are time-consuming, labor intensive, and often subject to inter-observer variability. This advocates the need of a fully automated method for bone age assessment. We propose a regression convolutional neural network (CNN) to automatically assess the pediatric bone age from hand radiograph. Our network is specifically trained to place more attention to those bone age related regions in the X-ray images. Specifically, we first adopt the attention module to process all images and generate the coarse/fine attention maps as inputs for the regression network. Then, the regression CNN follows the supervision of the dynamic attention loss during training; thus, it can estimate the bone age of the hard (or “outlier”) images more accurately. The experimental results show that our method achieves an average discrepancy of 5. 2–5. 3 months between clinical and automatic bone age evaluations on two large datasets. In conclusion, we propose a fully automated deep learning solution to process X-ray images of the hand for bone age assessment, with the accuracy comparable to human experts but with much better efficiency.

NeurIPS Conference 2018 Conference Paper

Adversarial Attacks on Stochastic Bandits

Kwang-Sung Jun
Lihong Li
Yuzhe Ma
Jerry Zhu

We study adversarial attacks that manipulate the reward signals to control the actions chosen by a stochastic multi-armed bandit algorithm. We propose the first attack against two popular bandit algorithms: $\epsilon$-greedy and UCB, \emph{without} knowledge of the mean rewards. The attacker is able to spend only logarithmic effort, multiplied by a problem-specific parameter that becomes smaller as the bandit problem gets easier to attack. The result means the attacker can easily hijack the behavior of the bandit algorithm to promote or obstruct certain actions, say, a particular medical treatment. As bandits are seeing increasingly wide use in practice, our study exposes a significant security threat.

AAAI Conference 2018 Conference Paper

BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems

Zachary Lipton
Xiujun Li
Jianfeng Gao
Lihong Li
Faisal Ahmed
Li Deng

We present a new algorithm that signiﬁcantly improves the efﬁciency of exploration for deep Q-learning agents in dialogue systems. Our agents explore via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop neural network. Our algorithm learns much faster than common exploration strategies such as -greedy, Boltzmann, bootstrapping, and intrinsic-reward-based ones. Additionally, we show that spiking the replay buffer with experiences from just a few successful episodes can make Q-learning feasible when it might otherwise fail.

NeurIPS Conference 2018 Conference Paper

Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation

Qiang Liu
Lihong Li
Ziyang Tang
Dengyong Zhou

We consider the off-policy estimation problem of estimating the expected reward of a target policy using samples collected by a different behavior policy. Importance sampling (IS) has been a key technique to derive (nearly) unbiased estimators, but is known to suffer from an excessively high variance in long-horizon problems. In the extreme case of in infinite-horizon problems, the variance of an IS-based estimator may even be unbounded. In this paper, we propose a new off-policy estimation method that applies IS directly on the stationary state-visitation distributions to avoid the exploding variance issue faced by existing estimators. Our key contribution is a novel approach to estimating the density ratio of two stationary distributions, with trajectories sampled from only the behavior distribution. We develop a mini-max loss function for the estimation problem, and derive a closed-form solution for the case of RKHS. We support our method with both theoretical and empirical analyses.

RLDM Conference 2017 Conference Abstract

Deep Reinforcement Learning for Conversational Systems

Lihong Li

Deep reinforcement learning (RL) has seen tremendous successes in solving video and board games, by leveraging great representational power of deep-learning models in the RL framework. More applications are emerging in robotics as well as natural language processing. In this talk, we demonstrate how deep RL can be used to develop dialogue systems that can converse like humans and help users solve specific tasks. In particular, we report work on three related projects: end-to-end training with an external knowledge base, domain extension where efficient exploration is required, and composite (hierarchical) task-completion dialogues.

NeurIPS Conference 2017 Conference Paper

Q-LDA: Uncovering Latent Patterns in Text-based Sequential Decision Processes

Jianshu Chen
Chong Wang
Lin Xiao
Ji He
Lihong Li
Li Deng

In sequential decision making, it is often important and useful for end users to understand the underlying patterns or causes that lead to the corresponding decisions. However, typical deep reinforcement learning algorithms seldom provide such information due to their black-box nature. In this paper, we present a probabilistic model, Q-LDA, to uncover latent patterns in text-based sequential decision processes. The model can be understood as a variant of latent topic models that are tailored to maximize total rewards; we further draw an interesting connection between an approximate maximum-likelihood estimation of Q-LDA and the celebrated Q-learning algorithm. We demonstrate in the text-game domain that our proposed method not only provides a viable mechanism to uncover latent patterns in decision processes, but also obtains state-of-the-art rewards in these games.

NeurIPS Conference 2016 Conference Paper

Active Learning with Oracle Epiphany

Tzu-Kuo Huang
Lihong Li
Ara Vartanian
Saleema Amershi
Jerry Zhu

We present a theoretical analysis of active learning with more realistic interactions with human oracles. Previous empirical studies have shown oracles abstaining on difficult queries until accumulating enough information to make label decisions. We formalize this phenomenon with an “oracle epiphany model” and analyze active learning query complexity under such oracles for both the realizable and the agnos- tic cases. Our analysis shows that active learning is possible with oracle epiphany, but incurs an additional cost depending on when the epiphany happens. Our results suggest new, principled active learning approaches with realistic oracles.

JBHI Journal 2016 Journal Article

α-Information-Based Registration of Dynamic Scans for Magnetic Resonance Cystography

Hao Han
Qin Lin
Lihong Li
Chaijie Duan
Hongbing Lu
Haifang Li
Zengmin Yan
John Fitzgerald

To continue our effort on developing magnetic resonance (MR) cystography, we introduce a novel nonrigid 3-D registration method to compensate for bladder wall motion and deformation in dynamic MR scans, which are impaired by relatively low signal-to-noise ratio in each time frame. The registration method is developed on the similarity measure of α-information, which has the potential of achieving higher registration accuracy than the commonly used mutual information (MI) measure for either monomodality or multimodality image registration. The α-information metric was also demonstrated to be superior to both the mean squares and the cross-correlation metrics in multimodality scenarios. The proposed α-registration method was applied for bladder motion compensation via real patient studies, and its effect to the automatic and accurate segmentation of bladder wall was also evaluated. Compared with the prevailing MI-based image registration approach, the presented α-information-based registration was more effective to capture the bladder wall motion and deformation, which ensured the success of the following bladder wall segmentation to achieve the goal of evaluating the entire bladder wall for detection and diagnosis of abnormality.

JBHI Journal 2015 Journal Article

Fast and Adaptive Detection of Pulmonary Nodules in Thoracic CT Images Using a Hierarchical Vector Quantization Scheme

Hao Han
Lihong Li
Fangfang Han
Bowen Song
William Moore
Zhengrong Liang

Computer-aided detection (CADe) of pulmonary nodules is critical to assisting radiologists in early identification of lung cancer from computed tomography (CT) scans. This paper proposes a novel CADe system based on a hierarchical vector quantization (VQ) scheme. Compared with the commonly-used simple thresholding approach, the high-level VQ yields a more accurate segmentation of the lungs from the chest volume. In identifying initial nodule candidates (INCs) within the lungs, the low-level VQ proves to be effective for INCs detection and segmentation, as well as computationally efficient compared to existing approaches. False-positive (FP) reduction is conducted via rule-based filtering operations in combination with a feature-based support vector machine classifier. The proposed system was validated on 205 patient cases from the publically available online Lung Image Database Consortium database, with each case having at least one juxta-pleural nodule annotation. Experimental results demonstrated that our CADe system obtained an overall sensitivity of 82. 7% at a specificity of 4 FPs/scan. Especially for the performance on juxta-pleural nodules, we observed 89. 2% sensitivity at 4. 14 FPs/scan. With respect to comparable CADe systems, the proposed system shows outperformance and demonstrates its potential for fast and adaptive detection of pulmonary nodules via CT imaging.

RLDM Conference 2015 Conference Abstract

The Online Discovery Problem and Its Application to Lifelong Reinforcement Learning

Emma Brunskill
Lihong Li

We study lifelong reinforcement learning where the agent extracts knowledge from solving a sequence of tasks to speed learning in future ones. We first formulate and study a related online discovery problem, which can be of independent interest, and propose an optimal algorithm with matching upper and lower bounds. These results are then applied to create a robust, continuous lifelong reinforcement learning algorithm with formal learning guarantees, applicable to a much wider scenarios, as verified in simulations.

TIST Journal 2014 Journal Article

Exploiting User Preference for Online Learning in Web Content Optimization Systems

Jiang Bian
Bo Long
Lihong Li
Taesup Moon
Anlei Dong
Yi Chang

Web portal services have become an important medium to deliver digital content (e.g. news, advertisements, etc.) to Web users in a timely fashion. To attract more users to various content modules on the Web portal, it is necessary to design a recommender system that can effectively achieve Web portal content optimization by automatically estimating content item attractiveness and relevance to user interests. The state-of-the-art online learning methodology adapts dedicated pointwise models to independently estimate the attractiveness score for each candidate content item. Although such pointwise models can be easily adapted for online recommendation, there still remain a few critical problems. First, this pointwise methodology fails to use invaluable user preferences between content items. Moreover, the performance of pointwise models decreases drastically when facing the problem of sparse learning samples. To address these problems, we propose exploring a new dynamic pairwise learning methodology for Web portal content optimization in which we exploit dynamic user preferences extracted based on users' actions on portal services to compute the attractiveness scores of content items. In this article, we introduce two specific pairwise learning algorithms, a straightforward graph-based algorithm and a formalized Bayesian modeling one. Experiments on large-scale data from a commercial Web portal demonstrate the significant improvement of pairwise methodologies over the baseline pointwise models. Further analysis illustrates that our new pairwise learning approaches can benefit personalized recommendation more than pointwise models, since the data sparsity is more critical for personalized content optimization.

EWRL Workshop 2013 Workshop Paper

Sample Complexity of Multi-task Reinforcement Learning

Emma Brunskill
Lihong Li

RLDM Conference 2013 Conference Abstract

Sample Complexity of Multi-task Reinforcement Learning

Emma Brunskill
Lihong Li

A key aspect of human intelligence is our ability to leverage prior experience to improve our learn- ing in future related tasks. Often these tasks themselves involve reinforcement learning, and an important goal in artificial intelligence is to create autonomous agents that perform better as they do a series of simi- lar RL tasks. Though there is encouraging empirical evidence that leveraging past knowledge can improve agent performance in subsequent reinforcement learning tasks, there has been very little theoretical analysis. Towards addressing this gap, we introduce a new algorithm for acting in a sequence of reinforcement learn- ing tasks when each task is sampled from (an unknown) distribution over a finite set of (unknown) Markov decision processes. In this setting, we prove under certain assumptions that the per-task sample complexity, the number of samples on which the agent may perform suboptimally, decreases significantly due to lever- aging prior learned knowledge compared to standard single-task algorithms. Our multi-task RL algorithm also has the desired characteristic that it is guaranteed not to exhibit negative transfer: up to log factors, its per-task sample complexity is never worse than the corresponding single-task algorithm. Poster Session 2, Saturday, October 26, 2013

NeurIPS Conference 2011 Conference Paper

An Empirical Evaluation of Thompson Sampling

Olivier Chapelle
Lihong Li

Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly not very popular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against.

NeurIPS Conference 2010 Conference Paper

Learning from Logged Implicit Exploration Data

Alex Strehl
John Langford
Lihong Li
Sham Kakade

We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in contextual bandit'' or partially labeled'' settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which ``offline'' data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from an Internet %online advertising company.

NeurIPS Conference 2010 Conference Paper

Parallelized Stochastic Gradient Descent

Martin Zinkevich
Markus Weimer
Lihong Li
Alex Smola

With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique --- contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime.

AAMAS Conference 2009 Conference Paper

Online Exploration in Least-Squares Policy Iteration

Lihong Li
Michael L. Littman
Christopher R. Mansley

One of the key problems in reinforcement learning is balancing exploration and exploitation. Another is learning and acting in large or even continuous Markov decision processes (MDPs), where compact function approximation has to be used. In this paper, we provide a practical solution to exploring large MDPs by integrating a powerful exploration technique, Rmax, into a state-of-the-art learning algorithm, least-squares policy iteration (LSPI). This approach combines the strengths of both methods, and has shown its effectiveness and superiority over LSPI with two other popular exploration rules in several benchmark problems.

JMLR Journal 2009 Journal Article

Provably Efficient Learning with Typed Parametric Models

Emma Brunskill
Bethany R. Leffler
Lihong Li
Michael L. Littman
Nicholas Roy

To quickly achieve good performance, reinforcement-learning algorithms for acting in large continuous-valued domains must use a representation that is both sufficiently powerful to capture important domain characteristics, and yet simultaneously allows generalization, or sharing, among experiences. Our algorithm balances this tradeoff by using a stochastic, switching, parametric dynamics representation. We argue that this model characterizes a number of significant, real-world domains, such as robot navigati on across varying terrain. We prove that this representational assumption allows our algorithm to be probably approximately correct with a sample complexity that scales polynomially with all problem-specific quantities including the state-space dimension. We also explicitly incorporate the error introduced by approximate planning in our sample complexity bounds, in contrast to prior Probably Approximately Correct (PAC) Markov Decision Processes (MDP) approaches, which typically assume the estimated MDP can be solved exactly. Our experimental results on constructing plans for driving to work using real car trajectory data, as well as a small robot experiment on navigating varying terrain, demonstrate that our dynamics representation enables us to capture real-world dynamics in a sufficient manner to produce good performance. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2009 Journal Article

Reinforcement Learning in Finite MDPs: PAC Analysis

Alexander L. Strehl
Lihong Li
Michael L. Littman

We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E 3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2009 Journal Article

Sparse Online Learning via Truncated Gradient

John Langford
Lihong Li
Tong Zhang

We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss functions. This method has several essential properties: The degree of sparsity is continuous---a parameter controls the rate of sparsification from no sparsification to total sparsification. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L 1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

NeurIPS Conference 2008 Conference Paper

Sparse Online Learning via Truncated Gradient

John Langford
Lihong Li
Tong Zhang

We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous---a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular $L_1$-regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find that for datasets with large numbers of features, substantial sparsity is discoverable.

AAAI Conference 2005 Conference Paper

Lazy Approximation for Solving Continuous Finite-Horizon MDPs

Lihong Li

Solving Markov decision processes (MDPs) with continuous state spaces is a challenge due to, among other problems, the well-known curse of dimensionality. Nevertheless, numerous real-world applications such as transportation planning and telescope observation scheduling exhibit a critical dependence on continuous states. Current approaches to continuous-state MDPs include discretizing their transition models. In this paper, we propose and study an alternative, discretizationfree approach we call lazy approximation. Empirical study shows that lazy approximation performs much better than discretization, and we successfully applied this new technique to a more realistic planetary rover planning problem.

IJCAI Conference 2003 Conference Paper

Lookahead Pathologies for Single Agent Search

Vadim Bulitko
Lihong Li
Russ Greiner
Ilya Levner