Author name cluster

Chenjun Xiao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2026 Conference Paper

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models (Abstract Reprint)

Yangchen Pan
Junfeng Wen
Chenjun Xiao
Philip Torr

Background: Traditional supervised learning (SL) assumes data points are independently and identically distributed (i.i.d.), which overlooks dependencies in real-world data. Reinforcement learning (RL), in contrast, models dependencies through state transitions. Objectives: This study aims to bridge SL and RL by reformulating SL problems as RL tasks, enabling the application of RL techniques to a wider range of SL scenarios. We aim to model SL data as interconnected, and develop novel temporal difference (TD) algorithms that can accommodate diverse data types. Our objectives are to (1) establish conditions where TD outperforms ordinary least squares (OLS), (2) provide convergence guarantees for the generalized TD algorithm, and (3) validate the approach empirically using synthetic and real-world datasets. Methods: We reformulate traditional SL as a RL problem by modeling data points as a Markov Reward Process (MRP). We then introduce a concept analogous to the inverse link function in generalized linear models, allowing our TD algorithm to handle various data types. Our analysis, grounded in variance estimation, identifies conditions where TD outperforms OLS. We establish a convergence guarantee by conceptualizing the TD update rule as a generalized Bellman operator. Empirical validation begins with synthetic data progressively matching theoretical assumptions to verify our analysis, followed by evaluations on real-world datasets to demonstrate practical utility. Results: Our theoretical analysis shows that TD can outperform OLS in estimation accuracy when data noise is correlated. Our approach generalizes across various loss functions and SL datasets. We prove that the Bellman operator in our TD framework is a contraction, ensuring convergence for both expected and stochastic TD updates. Empirically, TD outperforms SL baselines when data aligns with its assumptions, remains competitive across diverse datasets, and is robust to hyperparameter choices. Conclusions: This study demonstrates that SL can be reformulated as a problem of interconnected data modeled by an MRP, effectively solved using TD learning. Our generalized TD is theoretically sound, with convergence guarantees, and practically effective. It generalizes OLS, offering superior performance on correlated data. This work enables RL techniques to benefit SL tasks, offering a pathway for future advancements.

PDF Details DOI

JAIR Journal 2025 Journal Article

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models

Yangchen Pan
Junfeng Wen
Chenjun Xiao
Philip H. S. Torr

PDF Details DOI

ICML Conference 2025 Conference Paper

Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

Chen-Xiao Gao
Chenyang Wu 0001
Mingjun Cao
Chenjun Xiao
Yang Yu 0001
Zongzhang Zhang

Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.

Details

AAMAS Conference 2025 Conference Paper

β-DQN: Improving Deep Q-Learning By Evolving the Behavior

Hongming Zhang
Fengshuo Bai
Chenjun Xiao
Chao Gao
Bo Xu
Martin Müller

While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like 𝜖-greedy. Motivated by this, we introduce 𝛽-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function 𝛽. This function estimates the probability that each action has been taken at each state. By leveraging 𝛽, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. 𝛽-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that 𝛽-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.

PDF

NeurIPS Conference 2024 Conference Paper

Diffusion Spectral Representation for Reinforcement Learning

Dmitry Shribak
Chen-Xiao Gao
Yitong Li
Chenjun Xiao
Bo Dai

Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i. e. , sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Exploiting the Replay Memory Before Exploring the Environment: Enhancing Reinforcement Learning Through Empirical MDP Iteration

Hongming Zhang
Chenjun Xiao
Chao Gao
Han Wang
Bo Xu
Martin Müller

Reinforcement learning (RL) algorithms are typically based on optimizing a Markov Decision Process (MDP) using the optimal Bellman equation. Recent studies have revealed that focusing the optimization of Bellman equations solely on in-sample actions tends to result in more stable optimization, especially in the presence of function approximation. Upon on these findings, in this paper, we propose an Empirical MDP Iteration (EMIT) framework. EMIT constructs a sequence of empirical MDPs using data from the growing replay memory. For each of these empirical MDPs, it learns an estimated Q-function denoted as $\widehat{Q}$. The key strength is that by restricting the Bellman update to in-sample bootstrapping, each empirical MDP converges to a unique optimal $\widehat{Q}$ function. Furthermore, gradually expanding from the empirical MDPs to the original MDP induces a monotonic policy improvement. Instead of creating entirely new algorithms, we demonstrate that EMIT can be seamlessly integrated with existing online RL algorithms, effectively acting as a regularizer for contemporary Q-learning methods. We show this by implementing EMIT for two representative RL algorithms, DQN and TD3. Experimental results on Atari and MuJoCo benchmarks show that EMIT significantly reduces estimation errors and substantially improves the performance of both algorithms.

PDF Details DOI

ICML Conference 2024 Conference Paper

HarmonyDream: Task Harmonization Inside World Models

Haoyu Ma
Jialong Wu 0001
Ningya Feng
Chenjun Xiao
Dong Li 0016
Jianye Hao
Jianmin Wang 0001
Mingsheng Long

Model-based reinforcement learning (MBRL) holds the promise of sample-efficient learning by utilizing a world model, which models how the environment works and typically encompasses components for two tasks: observation modeling and reward modeling. In this paper, through a dedicated empirical investigation, we gain a deeper understanding of the role each task plays in world models and uncover the overlooked potential of sample-efficient MBRL by mitigating the domination of either observation or reward modeling. Our key insight is that while prevalent approaches of explicit MBRL attempt to restore abundant details of the environment via observation models, it is difficult due to the environment’s complexity and limited model capacity. On the other hand, reward models, while dominating implicit MBRL and adept at learning compact task-centric dynamics, are inadequate for sample-efficient learning without richer learning signals. Motivated by these insights and discoveries, we propose a simple yet effective approach, HarmonyDream, which automatically adjusts loss coefficients to maintain task harmonization, i. e. a dynamic equilibrium between the two tasks in world model learning. Our experiments show that the base MBRL method equipped with HarmonyDream gains 10%-69% absolute performance boosts on visual robotic tasks and sets a new state-of-the-art result on the Atari 100K benchmark. Code is available at https: //github. com/thuml/HarmonyDream.

Details

NeurIPS Conference 2024 Conference Paper

Iteratively Refined Behavior Regularization for Offline Reinforcement Learning

Yi Ma
Jianye Hao
Xiaohan Hu
Yan Zheng
Chenjun Xiao

One of the fundamental challenges for offline reinforcement learning (RL) is ensuring robustness to data distribution. Whether the data originates from a near-optimal policy or not, we anticipate that an algorithm should demonstrate its ability to learn an effective control policy that seamlessly aligns with the inherent distribution of offline data. Unfortunately, behavior regularization, a simple yet effective offline RL algorithm, tends to struggle in this regard. In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration. Our key observation is that by iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement, while also implicitly avoiding querying out-of-sample actions to prevent catastrophic learning failures. We prove that in the tabular setting this algorithm is capable of learning the optimal policy covered by the offline dataset, commonly referred to as the in-sample optimal policy. We then explore several implementation details of the algorithm when function approximations are applied. The resulting algorithm is easy to implement, requiring only a few lines of code modification to existing methods. Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks, clearly demonstrate its superiority over behavior regularization.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces

Xiaotian Hao
Jianye Hao
Chenjun Xiao
Kai Li
Dong Li
Yan Zheng

AlphaZero and MuZero have achieved state-of-the-art (SOTA) performance in a wide range of domains, including board games and robotics, with discrete and continuous action spaces. However, to obtain an improved policy, they often require an excessively large number of simulations, especially for domains with large action spaces. As the simulation budget decreases, their performance drops significantly. In addition, many important real-world applications have combinatorial (or exponential) action spaces, making it infeasible to search directly over all possible actions. In this paper, we extend AlphaZero and MuZero to learn and plan in more complex multiagent (MA) Markov decision processes, where the action spaces increase exponentially with the number of agents. Our new algorithms, MA Gumbel AlphaZero and MA Gumbel MuZero, respectively without and with model learning, achieve superior performance on cooperative multiagent control problems, while reducing the number of environmental interactions by up to an order of magnitude compared to model-free approaches. In particular, we significantly improve prior performance when planning with much fewer simulation budgets. The code and appendix are available at https://github.com/tjuHaoXiaotian/MA-MuZero.

PDF Details DOI

ICML Conference 2024 Conference Paper

Provable Representation with Efficient Planning for Partially Observable Reinforcement Learning

Hongming Zhang 0012
Tongzheng Ren
Chenjun Xiao
Dale Schuurmans
Bo Dai 0001

In most real-world reinforcement learning applications, state information is only partially observable, which breaks the Markov decision process assumption and leads to inferior performance for algorithms that conflate observations with state. Partially Observable Markov Decision Processes (POMDPs), on the other hand, provide a general framework that allows for partial observability to be accounted for in learning, exploration and planning, but presents significant computational and statistical challenges. To address these difficulties, we develop a representation-based perspective that leads to a coherent framework and tractable algorithmic approach for practical reinforcement learning from partial observations. We provide a theoretical analysis for justifying the statistical efficiency of the proposed algorithm, and also empirically demonstrate the proposed algorithm can surpass state-of-the-art performance with partial observations across various benchmarks, advancing reliable reinforcement learning towards more practical applications.

Details

ICML Conference 2024 Conference Paper

Rethinking Decision Transformer via Hierarchical Reinforcement Learning

Yi Ma 0005
Jianye Hao
Hebin Liang
Chenjun Xiao

Decision Transformer (DT) is an innovative algorithm leveraging recent advances of the transformer architecture in reinforcement learning (RL). However, a notable limitation of DT is its reliance on recalling trajectories from datasets, losing the capability to seamlessly stitch sub-optimal trajectories together. In this work we introduce a general sequence modeling framework for studying sequential decision making through the lens of Hierarchical RL. At the time of making decisions, a high-level policy first proposes an ideal prompt for the current state, a low-level policy subsequently generates an action conditioned on the given prompt. We show DT emerges as a special case of this framework with certain choices of high-level and low-level policies, and discuss the potential failure of these choices. Inspired by these observations, we study how to jointly optimize the high-level and low-level policies to enable the stitching ability, which further leads to the development of new offline RL algorithms. Our empirical results clearly show that the proposed algorithms significantly surpass DT on several control and navigation benchmarks. We hope our contributions can inspire the integration of transformer architectures within the field of RL.

Details

ICML Conference 2024 Conference Paper

Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

Fengdi Che
Chenjun Xiao
Jincheng Mei
Bo Dai 0001
Ramki Gummadi
Oscar Ramirez
Christopher K. Harris
A. Rupam Mahmood

We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

Details

UAI Conference 2023 Conference Paper

Conditionally optimistic exploration for cooperative deep multi-agent reinforcement learning

Xutong Zhao
Yangchen Pan
Chenjun Xiao
Sarath Chandar
Janarthanan Rajendran

Efficient exploration is critical in cooperative deep Multi-Agent Reinforcement Learning (MARL). In this work, we propose an exploration method that effectively encourages cooperative exploration based on the idea of sequential action-computation scheme. The high-level intuition is that to perform optimism-based exploration, agents would explore cooperative strategies if each agent’s optimism estimate captures a structured dependency relationship with other agents. Assuming agents compute actions following a sequential order at each environment timestep, we provide a perspective to view MARL as tree search iterations by considering agents as nodes at different depths of the search tree. Inspired by the theoretically justified tree search algorithm UCT (Upper Confidence bounds applied to Trees), we develop a method called Conditionally Optimistic Exploration (COE). COE augments each agent’s state-action value estimate with an action-conditioned optimistic bonus derived from the visitation count of the global state and joint actions of preceding agents. COE is performed during training and disabled at deployment, making it compatible with any value decomposition method for centralized training with decentralized execution. Experiments across various cooperative MARL benchmarks show that COE outperforms current state-of-the-art exploration methods on hard-exploration tasks.

Details

UAI Conference 2023 Conference Paper

Energy-based Predictive Representations for Partially Observed Reinforcement Learning

Tianjun Zhang
Tongzheng Ren
Chenjun Xiao
Wenli Xiao
Joseph E. Gonzalez
Dale Schuurmans
Bo Dai 0001

In real-world applications, handling partial observability is a common requirement for reinforcement learning algorithms, which is not captured by a Markov decision process (MDP). Although partially observable Markov decision processes (POMDPs) have been specifically designed to address this requirement, they present significant computational and statistical challenges in learning and planning. In this work, we introduce the Energy-based Predictive Representation (EPR) to provide a unified approach for designing practical reinforcement learning algorithms in both the MDP and POMDP settings. This framework enables coherent handling of learning, exploration, and planning tasks. The proposed framework leverages a powerful neural energy-based model to extract an adequate representation, allowing for efficient approximation of Q-functions. This representation facilitates the efficient computation of confidence, enabling the implementation of optimism or pessimism in planning when faced with uncertainty. Consequently, it effectively manages the trade-off between exploration and exploitation. Experimental investigations demonstrate that the proposed algorithm achieves state-of-the-art performance in both MDP and POMDP settings.

Details

ICLR Conference 2023 Conference Paper

Latent Variable Representation for Reinforcement Learning

Tongzheng Ren
Chenjun Xiao
Tianjun Zhang
Na Li 0002
Zhaoran Wang 0001
Sujay Sanghavi
Dale Schuurmans
Bo Dai 0001

Deep latent variable models have achieved significant empirical successes in model-based reinforcement learning (RL) due to their expressiveness in modeling complex transition dynamics. On the other hand, it remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of RL. In this paper, we provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle in the face of uncertainty for exploration. In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models. Theoretically, we establish the sample complexity of the proposed approach in the online and offline settings. Empirically, we demonstrate superior performance over current state-of-the-art algorithms across various benchmarks.

Details

ICLR Conference 2023 Conference Paper

Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replay

Hongming Zhang 0003
Chenjun Xiao
Han Wang
Jun Jin 0001
Bo Xu 0002
Martin Müller 0003

Experience replay, which stores transitions in a replay memory for repeated use, plays an important role of improving sample efficiency in reinforcement learning. Existing techniques such as reweighted sampling, episodic learning and reverse sweep update further process the information in the replay memory to make experience replay more efficient. In this work, we further exploit the information in the replay memory by treating it as an empirical \emph{Replay Memory MDP (RM-MDP)}. By solving it with dynamic programming, we learn a conservative value estimate that \emph{only} considers transitions observed in the replay memory. Both value and policy regularizers based on this conservative estimate are developed and integrated with model-free learning algorithms. We design the metric \textit{memory density} to measure the quality of RM-MDP. Our empirical studies quantitatively find a strong correlation between performance improvement and memory density. Our method combines \emph{Conservative Estimation with Experience Replay (CEER)}, improving sample efficiency by a large margin, especially when the memory density is high. Even when the memory density is low, such a conservative estimate can still help to avoid suicidal actions and thereby improve performance.

Details

ICLR Conference 2023 Conference Paper

The In-Sample Softmax for Offline Reinforcement Learning

Chenjun Xiao
Han Wang 0066
Yangchen Pan
Adam White 0001
Martha White

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an in-sample max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample softmax using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning. We release the code at github.com/hwang-ua/inac_pytorch.

Details

ICLR Conference 2022 Conference Paper

Understanding and Leveraging Overparameterization in Recursive Value Estimation

Chenjun Xiao
Bo Dai 0001
Jincheng Mei
Oscar Ramirez
Ramki Gummadi
Christopher K. Harris
Dale Schuurmans

The theory of function approximation in reinforcement learning (RL) typically considers low capacity representations that incur a tradeoff between approximation error, stability and generalization. Current deep architectures, however, operate in an overparameterized regime where approximation error is not necessarily a bottleneck. To better understand the utility of deep models in RL we present an analysis of recursive value estimation using \emph{overparameterized} linear representations that provides useful, transferable findings. First, we show that classical updates such as temporal difference (TD) learning or fitted-value-iteration (FVI) converge to \emph{different} fixed points than residual minimization (RM) in the overparameterized linear case. We then develop a unified interpretation of overparameterized linear value estimation as minimizing the Euclidean norm of the weights subject to alternative constraints. A practical consequence is that RM can be modified by a simple alteration of the backup targets to obtain the same fixed points as FVI and TD (when they converge), while universally ensuring stability. Further, we provide an analysis of the generalization error of these methods, demonstrating per iterate bounds on the value prediction error of FVI, and fixed point bounds for TD and RM. Given this understanding, we then develop new algorithmic tools for improving recursive value estimation with deep models. In particular, we extract two regularizers that penalize out-of-span top-layer weights and co-linearity in top-layer features respectively. Empirically we find that these regularizers dramatically improve the stability of TD and FVI, while allowing RM to match and even sometimes surpass their generalization performance with assured stability.

Details

ICML Conference 2021 Conference Paper

On the Optimality of Batch Policy Optimization Algorithms

Chenjun Xiao
Yifan Wu
Jincheng Mei
Bo Dai 0001
Tor Lattimore
Lihong Li 0001
Csaba Szepesvári
Dale Schuurmans

Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment. Although interest in this problem has grown significantly in recent years, its theoretical foundations remain under-developed. To advance the understanding of this problem, we provide three results that characterize the limits and possibilities of batch policy optimization in the finite-armed stochastic bandit setting. First, we introduce a class of confidence-adjusted index algorithms that unifies optimistic and pessimistic principles in a common framework, which enables a general analysis. For this family, we show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral. Our analysis reveals that instance-dependent optimality, commonly used to establish optimality of on-line stochastic bandit algorithms, cannot be achieved by any algorithm in the batch setting. In particular, for any algorithm that performs optimally in some environment, there exists another environment where the same algorithm suffers arbitrarily larger regret. Therefore, to establish a framework for distinguishing algorithms, we introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction. We demonstrate how this criterion can be used to justify commonly used pessimistic principles for batch policy optimization.

Details

NeurIPS Conference 2021 Conference Paper

Understanding the Effect of Stochasticity in Policy Optimization

Jincheng Mei
Bo Dai
Chenjun Xiao
Csaba Szepesvari
Dale Schuurmans

We study the effect of stochasticity in on-policy policy optimization, and make the following four contributions. \emph{First}, we show that the preferability of optimization methods depends critically on whether stochastic versus exact gradients are used. In particular, unlike the true gradient setting, geometric information cannot be easily exploited in the stochastic case for accelerating policy optimization without detrimental consequences or impractical assumptions. \emph{Second}, to explain these findings we introduce the concept of committal rate for stochastic policy optimization, and show that this can serve as a criterion for determining almost sure convergence to global optimality. \emph{Third}, we show that in the absence of external oracle information, which allows an algorithm to determine the difference between optimal and sub-optimal actions given only on-policy samples, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely. That is, an uninformed algorithm either converges to a globally optimal policy with probability $1$ but at a rate no better than $O(1/t)$, or it achieves faster than $O(1/t)$ convergence but then must fail to converge to the globally optimal policy with some positive probability. \emph{Finally}, we use the committal rate theory to explain why practical policy optimization methods are sensitive to random initialization, then develop an ensemble method that can be guaranteed to achieve near-optimal solutions with high probability.

PDF Details

NeurIPS Conference 2020 Conference Paper

Escaping the Gravitational Pull of Softmax

Jincheng Mei
Chenjun Xiao
Bo Dai
Lihong Li
Csaba Szepesvari
Dale Schuurmans

The softmax is the standard transformation used in machine learning to map real-valued vectors to categorical distributions. Unfortunately, this transform poses serious drawbacks for gradient descent (ascent) optimization. We reveal this difficulty by establishing two negative results: (1) optimizing any expectation with respect to the softmax must exhibit sensitivity to parameter initialization ( softmax gravity well''), and (2) optimizing log-probabilities under the softmax must exhibit slow convergence ( softmax damping''). Both findings are based on an analysis of convergence rates using the Non-uniform \L{}ojasiewicz (N\L{}) inequalities. To circumvent these shortcomings we investigate an alternative transformation, the \emph{escort} mapping, that demonstrates better optimization properties. The disadvantages of the softmax and the effectiveness of the escort transformation are further explained using the concept of N\L{} coefficient. In addition to proving bounds on convergence rates to firmly establish these results, we also provide experimental evidence for the superiority of the escort transformation.

PDF Details

ICML Conference 2020 Conference Paper

On the Global Convergence Rates of Softmax Policy Gradient Methods

Jincheng Mei
Chenjun Xiao
Csaba Szepesvári
Dale Schuurmans

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(1/t)$ rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Ł{}ojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate $O(e^{-t})$ toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new $\Omega(1/t)$ lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Ł{}ojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

Details

NeurIPS Conference 2019 Conference Paper

Maximum Entropy Monte-Carlo Planning

Chenjun Xiao
Ruitong Huang
Jincheng Mei
Dale Schuurmans
Martin Müller

We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT. Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.

PDF Details

IJCAI Conference 2019 Conference Paper

On Principled Entropy Exploration in Policy Optimization

Jincheng Mei
Chenjun Xiao
Ruitong Huang
Dale Schuurmans
Martin Müller

In this paper, we investigate Exploratory Conservative Policy Optimization (ECPO), a policy optimization strategy that improves exploration behavior while assuring monotonic progress in a principled objective. ECPO conducts maximum entropy exploration within a mirror descent framework, but updates policies using reversed KL projection. This formulation bypasses undesirable mode seeking behavior and avoids premature convergence to sub-optimal policies, while still supporting strong theoretical properties such as guaranteed policy improvement. Experimental evaluations demonstrate that the proposed method significantly improves practical exploration and surpasses the empirical performance of state-of-the art policy optimization methods in a set of benchmark tasks.

PDF Details

AAAI Conference 2018 Conference Paper

Memory-Augmented Monte Carlo Tree Search

Chenjun Xiao
Jincheng Mei
Martin Müller

This paper proposes and evaluates Memory-Augmented Monte Carlo Tree Search (M-MCTS), which provides a new approach to exploit generalization in online real-time search. The key idea of M-MCTS is to incorporate MCTS with a memory structure, where each entry contains information of a particular state. This memory is used to generate an approximate value estimation by combining the estimations of similar states. We show that the memory based value approximation is better than the vanilla Monte Carlo estimation with high probability under mild conditions. We evaluate M- MCTS in the game of Go. Experimental results show that M- MCTS outperforms the original MCTS with the same number of simulations.

PDF Details

AAAI Conference 2016 Conference Paper

Factorization Ranking Model for Move Prediction in the Game of Go

Chenjun Xiao
Martin Müller

In this paper, we investigate the move prediction problem in the game of Go by proposing a new ranking model named Factorization Bradley Terry (FBT) model. This new model considers the move prediction problem as group competitions while also taking the interaction between features into account. A FBT model is able to provide a probability distribution that expresses a preference over moves. Therefore it can be easily compiled into an evaluation function and applied in a modern Go program. We propose a Stochastic Gradient Decent (SGD) algorithm to train a FBT model using expert game records, and provide two methods for fast computation of the gradient in order to speed up the training process. Experimental results show that our FBT model outperforms the state-of-the-art move prediction system of Latent Factor Ranking (LFR).

PDF Details