Arrow Research search

Author name cluster

Li Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

38 papers
1 author row

Possible papers

38

AAAI Conference 2026 Conference Paper

IGIANet: Illumination Guided Implicit Alignment Network for Infrared–Visible UAV Detection

  • Xiangqi Chen
  • Dawei Zhang
  • Li Zhao
  • Chengzhuan Yang
  • Zhongyu Chen
  • Jungang Lou
  • Zhonglong Zheng
  • Sang-Woon Jeon

Visible-Infrared (RGB-IR) Unmanned Aerial Vehicle (UAV) object detection integrates complementary cues from visible and infrared sensors, offering broad application potential. However, due to sensor parallax, it still faces the challenge of weak spatial misalignment, which significantly limits its performance in UAV-based object detection. Existing methods emphasize strict alignment, overlooking spectral heterogeneity under varying illumination. To address these issues, we propose the Illumination Guided Implicit Alignment Network (IGIANet) to mitigate modality heterogeneity without explicit alignment. Specifically, we integrate three novel modules. First, we propose an illumination-guided frequency modulation module that adaptively allocates fusion weights to visible and infrared features based on global illumination estimation, effectively alleviating modality imbalance under varying lighting conditions. Second, we introduce a frequency-guided cross-modality differential enhancement module, which computes differential cues across frequency domains to enhance complementary information and highlight weakly aligned and low-contrast regions. Finally, we introduce an implicit alignment-driven dynamic fusion module that actively estimates offsets and generates dynamic, position-adaptive fusion kernels to align and fuse modalities. Extensive experiments demonstrate that IGIANet outperforms state-of-the-art models on various benchmarks, achieving 80.9% mAP on DroneVehicle, 57.1% mAP on VEDAI, and 49.4% mAP on FLIR.

AAAI Conference 2026 Conference Paper

MaskAnyNet: Rethinking Masked Image Regions as Valuable Information in Supervised Learning

  • Jingshan Hong
  • Haigen Hu
  • Huihuang Zhang
  • Qianwei Zhou
  • Li Zhao

In supervised learning, traditional image masking faces two key issues: (i) discarded pixels are underutilized, leading to a loss of valuable contextual information; (ii) masking may remove small or critical features, especially in fine-grained tasks. In contrast, masked image modeling (MIM) has demonstrated that masked regions can be reconstructed from partial input, revealing that even incomplete data can exhibit strong contextual consistency with the original image. This highlights the potential of masked regions as sources of semantic diversity. Motivated by this, we revisit the image masking approach, proposing to treat masked content as auxiliary knowledge rather than ignored. Based on this, we proposed MaskAnyNet, which combines masking with a relearning mechanism to exploit both visible and masked information. It can be easily extended to any model with an additional branch to jointly learn from the recomposed masked region. This approach leverages the semantic diversity of masked regions to enrich features and preserve fine-grained details. Experiments on CNN and Transformer backbones show consistent gains across multiple benchmarks. Further analysis confirms that the proposed method improves semantic diversity through the reuse of masked content.

YNIMG Journal 2025 Journal Article

Age and gender-related patterns of arterial transit time and cerebral blood flow in healthy adults

  • Zongpai Zhang
  • Elizabeth Riley
  • Shichun Chen
  • Li Zhao
  • Adam K. Anderson
  • Eve DeRosa
  • Weiying Dai

Normal aging has been associated with increased arterial transit time (ATT) and reduced cerebral blood flow (CBF). However, age-related patterns of ATT and CBF and their relationship remain unclear. This is partly due to the lengthy scan times required for ATT measurements, which caused previous age-related CBF studies to not fully account for transit time. In this work, we aimed to elucidate age-related ATT and ATT-corrected CBF patterns. We examined 131 healthy subjects aged 19 to 82 years old using two pseudo-continuous arterial spin labeling (PCASL) MRI scans: one to measure fast low-resolution ATT maps with five post-labeling delays and the other to measure high-resolution perfusion-weighted maps with a single post-labeling delay. Both ATT and perfusion-weighed maps were applied with vessel suppression. We found that ATT increases with age in the frontal, temporoparietal, and occipital regions, with a more pronounced elongation in males compared to females in the middle temporal gyrus. ATT-corrected CBF decreases with age in several brain regions, including the anterior cingulate, insula, posterior cingulate, angular, precuneus, supramarginal, frontal, parietal, superior and middle temporal, occipital, and cerebellar regions, while remaining stable in the inferior temporal and subcortical regions. In contrast, without ATT correction, we detected artifactual decreases in the inferior temporal and precentral regions. These findings suggest that ATT provides valuable and independent insights into microvascular deficits and should be incorporated into CBF measurements for studies involving aging populations.

NeurIPS Conference 2025 Conference Paper

Dyn-O: Building Structured World Models with Object-Centric Representations

  • Zizhao Wang
  • Kaixin Wang
  • Li Zhao
  • Peter Stone
  • Jiang Bian

World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can be effective in more challenging settings. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we demonstrate that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object centric features into dynamic-agnostic and dynamic-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories. The code of Dyn-O can be found at: https: //github. com/wangzizhao/dyn-O.

YNIMG Journal 2025 Journal Article

Morphological changes of the choroid plexus in the lateral ventricle across the lifespan: 5551 subjects from fetus to elderly

  • Jiaxin Li
  • Yuxuan Gao
  • Yunzhi Xu
  • Weiying Dai
  • Yueqin Hu
  • Xue Feng
  • Dan Wu
  • Li Zhao

BACKGROUND: The developmental trajectory and aging process of the choroid plexus (ChP) in humans remain largely unexplored, and normative growth standards for ChP across the lifespan are lacking. METHODS: High-resolution magnetic resonance images were collected from cohorts of 5551 subjects, ranging in age from 21 gestational weeks to 90 years. ChP segmentation was performed using a combination of automated pipeline and manual annotations. The ChP volume, the ratio of ChP to brain parenchyma, and the ratio of ChP to brain ventricle were modeled using linear and quadratic regression. Additional morphological features of the ChP were investigated. RESULTS: The absolute and relative volumes of the ChP throughout lifespan were provided, including growth charts and a normative reference table. In addition, the morphological features, including max 3D diameter, flatness, and elongation of the ChP, reveal the turning points of fetal brain developments in the third trimester. Furthermore, the ratio of ChP-to-lateral ventricle, the ratio of ChP-to-brain parenchyma, flatness, and elongation of the ChP reveal characteristics of brain aging beginning at 30 years old. The enhanced ChP segmentation pipeline is available on GitHub: https://github.com/princeleeee/ChP-Seg. CONCLUSIONS: This study provides a baseline measurement of ChP across the lifespan, which reveals ChP characteristics in brain development and aging.

AAAI Conference 2025 Conference Paper

One-Shot Reference-based Structure-Aware Image to Sketch Synthesis

  • Rui Yang
  • Honghong Yang
  • Li Zhao
  • Qin Lei
  • Mianxiong Dong
  • Kaoru Ota
  • Xiaojun Wu

Generating sketches that accurately reflect the content of reference images presents numerous challenges. Current methods either require paired training data or fail to accommodate a wider range and diversity of sketch styles. While pre-trained diffusion models have shown strong text-based control capabilities for reference-based content sketch generation, state-of-the-art methods still struggle with reference-based sketch generation for given content. The main difficulties lie in (1) balancing content preservation with style enhancement, and (2) representing content image textures at varying levels of abstraction to approximate the reference sketch style. In this paper, we propose a method (Ref2Sketch-SA) that transforms a given content image into a sketch based on a reference sketch. The core strategies include (1) using DDIM Inversion to enhance structural consistency in the sketch generation of content images; (2) injecting noise into the input image during the denoising process to produce a sketch that retains content attributes while aligning with, yet differing in texture from, the reference. Our model demonstrates superior performance across multiple evaluation metrics, including user style preference.

NeurIPS Conference 2025 Conference Paper

What Do Latent Action Models Actually Learn?

  • Chuheng Zhang
  • Tim Pearce
  • Pushi Zhang
  • Kaixin Wang
  • Xiaoyu Chen
  • Wei Shen
  • Li Zhao
  • Jiang Bian

Latent action models (LAMs) aim to learn action-relevant changes from unlabeled videos by compressing changes between frames as latents. However, differences between video frames can be caused by \textit{controllable changes} as well as exogenous noise, leading to an important concern -- do latents capture the changes caused by actions or irrelevant noise? This paper studies this issue analytically, presenting a linear model that encapsulates the essence of LAM learning, while being tractable. This provides several insights, including connections between LAM and principal component analysis (PCA), desiderata of the data-generating policy, and justification of strategies to encourage learning controllable changes using data augmentation, data cleaning, and auxiliary action-prediction. We also provide illustrative results based on numerical simulation, shedding light on the specific structure of observations, actions, and noise in data that influence LAM learning.

YNIMG Journal 2024 Journal Article

An improved spectral clustering method for accurate detection of brain resting-state networks

  • Jason Barrett
  • Haomiao Meng
  • Zongpai Zhang
  • Song M. Chen
  • Li Zhao
  • David C. Alsop
  • Xingye Qiao
  • Weiying Dai

This paper proposes a data-driven analysis method to accurately partition large-scale resting-state functional brain networks from fMRI data. The method is based on a spectral clustering algorithm and combines eigenvector direction selection with Pearson correlation clustering in the spectral space. The method is an improvement on available spectral clustering methods, capable of robustly identifying active brain networks consistent with those from model-driven methods at different noise levels, even at the noise level of real fMRI data.

IJCAI Conference 2024 Conference Paper

Diversification of Adaptive Policy for Effective Offline Reinforcement Learning

  • Yunseon Choi
  • Li Zhao
  • Chuheng Zhang
  • Lei Song
  • Jiang Bian
  • Kee-Eung Kim

Offline Reinforcement Learning (RL) aims to learn policies from pre-collected datasets that capture only a subset of the environment's dynamics. The predominant approach has been to solve a constrained optimization formulation, which ensures that the policy visits state-action pairs within the support of the offline dataset. However, this approach has limited the ability to make decisions when the agent faces unknown parts of the environment at deployment time. To address the challenge of decision-making in out-of-support regions, model-based Bayes-adaptive approaches have been proposed by considering all dynamics models that could potentially be the true environment. Since it is generally infeasible to compute the posterior of all dynamics models based on the offline dataset, these approaches usually approximate the posterior by using a finite ensemble of highly probable dynamics models. Hence, the diversity of these models is the key to obtaining good policies. In this work, we propose MoDAP (Model-based Diverse Adaptive Policy Learning), an algorithm to enable the adaptive policy to make informed decisions in previously unexplored states. MoDAP adopts an iterative strategy that simultaneously training the policy and dynamics models. The policy optimization seeks to maximize expected returns across dynamics models, while the dynamics models are trained to promote policy diversification through the proposed information-theoretic objective. We evaluate MoDAP through experiments on the D4RL and NeoRL benchmarks, showcasing its performance superiority over state-of-the-art algorithms.

AAAI Conference 2024 Conference Paper

VSFormer: Visual-Spatial Fusion Transformer for Correspondence Pruning

  • Tangfei Liao
  • Xiaoqin Zhang
  • Li Zhao
  • Tao Wang
  • Guobao Xiao

Correspondence pruning aims to find correct matches (inliers) from an initial set of putative correspondences, which is a fundamental task for many applications. The process of finding is challenging, given the varying inlier ratios between scenes/image pairs due to significant visual differences. However, the performance of the existing methods is usually limited by the problem of lacking visual cues (e.g., texture, illumination, structure) of scenes. In this paper, we propose a Visual-Spatial Fusion Transformer (VSFormer) to identify inliers and recover camera poses accurately. Firstly, we obtain highly abstract visual cues of a scene with the cross attention between local features of two-view images. Then, we model these visual cues and correspondences by a joint visual-spatial fusion module, simultaneously embedding visual cues into correspondences for pruning. Additionally, to mine the consistency of correspondences, we also design a novel module that combines the KNN-based graph and the transformer, effectively capturing both local and global contexts. Extensive experiments have demonstrated that the proposed VSFormer outperforms state-of-the-art methods on outdoor and indoor benchmarks. Our code is provided at the following repository: https://github.com/sugar-fly/VSFormer.

AAMAS Conference 2023 Conference Paper

Curriculum Offline Reinforcement Learning

  • Yuanying Cai
  • Chuheng Zhang
  • Hanye Zhao
  • Li Zhao
  • Jiang Bian

Offline reinforcement learning holds the promise of obtaining powerful agents from large datasets. To achieve this, a good algorithm should always benefit from (or at least does not degenerate by) adding more samples, even if the samples are not collected by expert policies. However, we observe that many popular offline RL algorithms do not possess such a property and sometimes suffers from adding heterogeneous or poor samples to the dataset. Empirically we show that, given a stage in the learning process, not all samples are useful for these algorithms. Specifically, the agent can learn more efficiently with only the samples collected by a policy similar to the current policy. This indicates that different samples may contribute to different stages of the training process, and therefore we propose Curriculum Offline Reinforcement Learning (CUORL) to equip the previous methods with the such a favorable property. In CUORL, we select the samples that are likely to be generated by the current policy to train the agent. Empirically, we show that CUORL can prevent the negative impact of adding the samples from poor policies and always improves the performance with more samples (even from random policies). Moreover, CUORL also achieves stateof-the-art performance on standard D4RL datasets, which indicates the potential of curriculum learning for offline RL.

NeurIPS Conference 2023 Conference Paper

Distributional Pareto-Optimal Multi-Objective Reinforcement Learning

  • Xin-Qiang Cai
  • Pushi Zhang
  • Li Zhao
  • Jiang Bian
  • Masashi Sugiyama
  • Ashley Llorens

Multi-objective reinforcement learning (MORL) has been proposed to learn control policies over multiple competing objectives with each possible preference over returns. However, current MORL algorithms fail to account for distributional preferences over the multi-variate returns, which are particularly important in real-world scenarios such as autonomous driving. To address this issue, we extend the concept of Pareto-optimality in MORL into distributional Pareto-optimality, which captures the optimality of return distributions, rather than the expectations. Our proposed method, called Distributional Pareto-Optimal Multi-Objective Reinforcement Learning~(DPMORL), is capable of learning distributional Pareto-optimal policies that balance multiple objectives while considering the return uncertainty. We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods.

AAAI Conference 2023 Conference Paper

H-TSP: Hierarchically Solving the Large-Scale Traveling Salesman Problem

  • Xuanhao Pan
  • Yan Jin
  • Yuandong Ding
  • Mingxiao Feng
  • Li Zhao
  • Lei Song
  • Jiang Bian

We propose an end-to-end learning framework based on hierarchical reinforcement learning, called H-TSP, for addressing the large-scale Traveling Salesman Problem (TSP). The proposed H-TSP constructs a solution of a TSP instance starting from the scratch relying on two components: the upper-level policy chooses a small subset of nodes (up to 200 in our experiment) from all nodes that are to be traversed, while the lower-level policy takes the chosen nodes as input and outputs a tour connecting them to the existing partial route (initially only containing the depot). After jointly training the upper-level and lower-level policies, our approach can directly generate solutions for the given TSP instances without relying on any time-consuming search procedures. To demonstrate effectiveness of the proposed approach, we have conducted extensive experiments on randomly generated TSP instances with different numbers of nodes. We show that H-TSP can achieve comparable results (gap 3.42% vs. 7.32%) as SOTA search-based approaches, and more importantly, we reduce the time consumption up to two orders of magnitude (3.32s vs. 395.85s). To the best of our knowledge, H-TSP is the first end-to-end deep reinforcement learning approach that can scale to TSP instances of up to 10000 nodes. Although there are still gaps to SOTA results with respect to solution quality, we believe that H-TSP will be useful for practical applications, particularly those that are time-sensitive e.g., on-call routing and ride hailing service.

AAAI Conference 2023 Conference Paper

Pointerformer: Deep Reinforced Multi-Pointer Transformer for the Traveling Salesman Problem

  • Yan Jin
  • Yuandong Ding
  • Xuanhao Pan
  • Kun He
  • Li Zhao
  • Tao Qin
  • Lei Song
  • Jiang Bian

Traveling Salesman Problem (TSP), as a classic routing optimization problem originally arising in the domain of transportation and logistics, has become a critical task in broader domains, such as manufacturing and biology. Recently, Deep Reinforcement Learning (DRL) has been increasingly employed to solve TSP due to its high inference efficiency. Nevertheless, most of existing end-to-end DRL algorithms only perform well on small TSP instances and can hardly generalize to large scale because of the drastically soaring memory consumption and computation time along with the enlarging problem scale. In this paper, we propose a novel end-to-end DRL approach, referred to as Pointerformer, based on multi-pointer Transformer. Particularly, Pointerformer adopts both reversible residual network in the encoder and multi-pointer network in the decoder to effectively contain memory consumption of the encoder-decoder architecture. To further improve the performance of TSP solutions, Pointerformer employs a feature augmentation method to explore the symmetries of TSP at both training and inference stages as well as an enhanced context embedding approach to include more comprehensive context information in the query. Extensive experiments on a randomly generated benchmark and a public benchmark have shown that, while achieving comparative results on most small-scale TSP instances as state-of-the-art DRL approaches do, Pointerformer can also well generalize to large-scale TSPs.

IJCAI Conference 2023 Conference Paper

Towards Generalizable Reinforcement Learning for Trade Execution

  • Chuheng Zhang
  • Yitong Duan
  • Xiaoyu Chen
  • Jianyu Chen
  • Jian Li
  • Li Zhao

Optimized trade execution is to sell (or buy) a given amount of assets in a given time with the lowest possible trading cost. Recently, reinforcement learning (RL) has been applied to optimized trade execution to learn smarter policies from market data. However, we find that many existing RL methods exhibit considerable overfitting which prevents them from real deployment. In this paper, we provide an extensive study on the overfitting problem in optimized trade execution. First, we model the optimized trade execution as offline RL with dynamic context (ORDC), where the context represents market variables that cannot be influenced by the trading policy and are collected in an offline manner. Under this framework, we derive the generalization bound and find that the overfitting issue is caused by large context space and limited context samples in the offline setting. Accordingly, we propose to learn compact representations for context to address the overfitting problem, either by leveraging prior knowledge or in an end-to-end manner. To evaluate our algorithms, we also implement a carefully designed simulator based on historical limit order book (LOB) data to provide a high-fidelity benchmark for different algorithms. Our experiments on the high-fidelity simulator demonstrate that our algorithms can effectively alleviate overfitting and achieve better performance.

NeurIPS Conference 2022 Conference Paper

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

  • Xiaoyu Chen
  • Xiangming Zhu
  • Yufeng Zheng
  • Pushi Zhang
  • Li Zhao
  • Wenxue Cheng
  • Peng Cheng
  • Yongqiang Xiong

One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.

NeurIPS Conference 2022 Conference Paper

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

  • Jiawei Huang
  • Li Zhao
  • Tao Qin
  • Wei Chen
  • Nan Jiang
  • Tie-Yan Liu

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ (``O'' for ``online'') interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $\pi^{\text{E}}$ (``E'' for ``exploit'') exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i. e. , $\pi^{\text{E}}=\pi^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs. ~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $\Omega(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $\pi^{\text{E}}$.

NeurIPS Conference 2021 Conference Paper

Curriculum Offline Imitating Learning

  • Minghuan Liu
  • Hanye Zhao
  • Zhengyu Yang
  • Jian Shen
  • Weinan Zhang
  • Li Zhao
  • Tie-Yan Liu

Offline reinforcement learning (RL) tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. Despite the potential to surpass the behavioral policies, RL-based methods are generally impractical due to the training instability and bootstrapping the extrapolation errors, which always require careful hyperparameter tuning via online evaluation. In contrast, offline imitation learning (IL) has no such issues since it learns the policy directly without estimating the value function by bootstrapping. However, IL is usually limited in the capability of the behavioral policy and tends to learn a mediocre behavior from the dataset collected by the mixture of policies. In this paper, we aim to take advantage of IL but mitigate such a drawback. Observing that behavior cloning is able to imitate neighboring policies with less data, we propose \textit{Curriculum Offline Imitation Learning (COIL)}, which utilizes an experience picking strategy to make the agent imitate from adaptive neighboring policies with a higher return, and improves the current policy along curriculum stages. On continuous control benchmarks, we compare COIL against both imitation-based methods and RL-based methods, showing that COIL not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.

NeurIPS Conference 2021 Conference Paper

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

  • Pushi Zhang
  • Xiaoyu Chen
  • Li Zhao
  • Wei Xiong
  • Tao Qin
  • Tie-Yan Liu

A growing trend for value-based reinforcement learning (RL) algorithms is to capture more information than scalar value functions in the value network. One of the most well-known methods in this branch is distributional RL, which models return distribution instead of scalar value. In another line of work, hybrid reward architectures (HRA) in RL have studied to model source-specific value functions for each source of reward, which is also shown to be beneficial in performance. To fully inherit the benefits of distributional RL and hybrid reward architectures, we introduce Multi-Dimensional Distributional DQN (MD3QN), which extends distributional RL to model the joint return distribution from multiple reward sources. As a by-product of joint distribution modeling, MD3QN can capture not only the randomness in returns for each source of reward, but also the rich reward correlation between the randomness of different sources. We prove the convergence for the joint distributional Bellman operator and build our empirical algorithm by minimizing the Maximum Mean Discrepancy between joint return distribution and its Bellman target. In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions, and outperforms previous RL methods utilizing multi-dimensional reward functions in the control setting.

IJCAI Conference 2021 Conference Paper

Independence-aware Advantage Estimation

  • Pushi Zhang
  • Li Zhao
  • Guoqing Liu
  • Jiang Bian
  • Minlie Huang
  • Tao Qin
  • Tie-Yan Liu

Most of existing advantage function estimation methods in reinforcement learning suffer from the problem of high variance, which scales unfavorably with the time horizon. To address this challenge, we propose to identify the independence property between current action and future states in environments, which can be further leveraged to effectively reduce the variance of the advantage estimation. In particular, the recognized independence property can be naturally utilized to construct a novel importance sampling advantage estimator with close-to-zero variance even when the Monte-Carlo return signal yields a large variance. To further remove the risk of the high variance introduced by the new estimator, we combine it with existing Monte-Carlo estimator via a reward decomposition model learned by minimizing the estimation variance. Experiments demonstrate that our method achieves higher sample efficiency compared with existing advantage estimation methods in complex environments.

NeurIPS Conference 2021 Conference Paper

Object-Aware Regularization for Addressing Causal Confusion in Imitation Learning

  • Jongjin Park
  • Younggyo Seo
  • Chang Liu
  • Li Zhao
  • Tao Qin
  • Jinwoo Shin
  • Tie-Yan Liu

Behavioral cloning has proven to be effective for learning sequential decision-making policies from expert demonstrations. However, behavioral cloning often suffers from the causal confusion problem where a policy relies on the noticeable effect of expert actions due to the strong correlation but not the cause we desire. This paper presents Object-aware REgularizatiOn (OREO), a simple technique that regularizes an imitation policy in an object-aware manner. Our main idea is to encourage a policy to uniformly attend to all semantic objects, in order to prevent the policy from exploiting nuisance variables strongly correlated with expert actions. To this end, we introduce a two-stage approach: (a) we extract semantic objects from images by utilizing discrete codes from a vector-quantized variational autoencoder, and (b) we randomly drop the units that share the same discrete code together, i. e. , masking out semantic objects. Our experiments demonstrate that OREO significantly improves the performance of behavioral cloning, outperforming various other regularization and causality-based methods on a variety of Atari environments and a self-driving CARLA environment. We also show that our method even outperforms inverse reinforcement learning methods trained with a considerable amount of environment interaction.

NeurIPS Conference 2020 Conference Paper

RD$^2$: Reward Decomposition with Representation Decomposition

  • Zichuan Lin
  • Derek Yang
  • Li Zhao
  • Tao Qin
  • Guangwen Yang
  • Tie-Yan Liu

Reward decomposition, which aims to decompose the full reward into multiple sub-rewards, has been proven beneficial for improving sample efficiency in reinforcement learning. Existing works on discovering reward decomposition are mostly policy dependent, which constrains diverse or disentangled behavior between different policies induced by different sub-rewards. In this work, we propose a set of novel reward decomposition principles by constraining uniqueness and compactness of different state features/representations relevant to different sub-rewards. Our principles encourage sub-rewards with minimal relevant features, while maintaining the uniqueness of each sub-reward. We derive a deep learning algorithm based on our principle, and term our method as RD$^2$, since we learn reward decomposition and representation decomposition jointly. RD$^2$ is evaluated on a toy case, where we have the true reward structure, and some Atari environments where reward structure exists but is unknown to the agent to demonstrate the effectiveness of RD$^2$ against existing reward decomposition methods.

NeurIPS Conference 2019 Conference Paper

Distributional Reward Decomposition for Reinforcement Learning

  • Zichuan Lin
  • Li Zhao
  • Derek Yang
  • Tao Qin
  • Tie-Yan Liu
  • Guangwen Yang

Many reinforcement learning (RL) tasks have specific properties that can be leveraged to modify existing RL algorithms to adapt to those tasks and further improve performance, and a general class of such properties is the multiple reward channel. In those environments the full reward can be decomposed into sub-rewards obtained from different channels. Existing work on reward decomposition either requires prior knowledge of the environment to decompose the full reward, or decomposes reward without prior knowledge but with degraded performance. In this paper, we propose Distributional Reward Decomposition for Reinforcement Learning (DRDRL), a novel reward decomposition algorithm which captures the multiple reward channel structure under distributional setting. Empirically, our method captures the multi-channel structure and discovers meaningful reward decomposition, without any requirements on prior knowledge. Consequently, our agent achieves better performance than existing methods on environments with multiple reward channels.

NeurIPS Conference 2019 Conference Paper

Fully Parameterized Quantile Function for Distributional Reinforcement Learning

  • Derek Yang
  • Li Zhao
  • Zichuan Lin
  • Tao Qin
  • Jiang Bian
  • Tie-Yan Liu

Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated distributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fixed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i. e. , the x-axis) and the value axis (i. e. , y-axis) for distributional RL. Our algorithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to find the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm significantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents.

AAAI Conference 2019 Conference Paper

Trust Region Evolution Strategies

  • Guoqing Liu
  • Li Zhao
  • Feidiao Yang
  • Jiang Bian
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

Evolution Strategies (ES), a class of black-box optimization algorithms, has recently been demonstrated to be a viable alternative to popular MDP-based RL techniques such as Qlearning and Policy Gradients. ES achieves fairly good performance on challenging reinforcement learning problems and is easier to scale in a distributed setting. However, standard ES algorithms perform one gradient update per data sample, which is not very efficient. In this paper, with the purpose of more efficient using of sampled data, we propose a novel iterative procedure that optimizes a surrogate objective function, enabling to reuse data sample for multiple epochs of updates. We prove monotonic improvement guarantee for such procedure. By making several approximations to the theoretically-justified procedure, we further develop a practical algorithm called Trust Region Evolution Strategies (TRES). Our experiments demonstrate the effectiveness of TRES on a range of popular MuJoCo locomotion tasks in the OpenAI Gym, achieving better performance than ES algorithm.

AAAI Conference 2018 Conference Paper

Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization

  • Yijun Wang
  • Yingce Xia
  • Li Zhao
  • Jiang Bian
  • Tao Qin
  • Guiquan Liu
  • Tie-Yan Liu

Neural machine translation (NMT) heavily relies on parallel bilingual data for training. Since large-scale, high-quality parallel corpora are usually costly to collect, it is appealing to exploit monolingual corpora to improve NMT. Inspired by the law of total probability, which connects the probability of a given target-side monolingual sentence to the conditional probability of translating from a source sentence to the target one, we propose to explicitly exploit this connection to learn from and regularize the training of NMT models using monolingual data. The key technical challenge of this approach is that there are exponentially many source sentences for a target monolingual sentence while computing the sum of the conditional probability given each possible source sentence. We address this challenge by leveraging the dual translation model (target-to-source translation) to sample several mostly likely source-side sentences and avoid enumerating all possible candidate source sentences. That is, we transfer the knowledge contained in the dual model to boost the training of the primal model (source-to-target translation), and we call such an approach dual transfer learning. Experiment results on English→French and German→English tasks demonstrate that dual transfer learning achieves significant improvement over several strong baselines and obtains new state-of-the-art results.

AAAI Conference 2018 Conference Paper

Learning Structured Representation for Text Classification via Reinforcement Learning

  • Tianyang Zhang
  • Minlie Huang
  • Li Zhao

Representation learning is a fundamental problem in natural language processing. This paper studies how to learn a structured representation for text classification. Unlike most existing representation models that either use no structure or rely on pre-specified structures, we propose a reinforcement learning (RL) method to learn sentence representation by discovering optimized structures automatically. We demonstrate two attempts to build structured representation: Information Distilled LSTM (ID-LSTM) and Hierarchically Structured LSTM (HS-LSTM). ID-LSTM selects only important, taskrelevant words, and HS-LSTM discovers phrase structures in a sentence. Structure discovery in the two representation models is formulated as a sequential decision problem: current decision of structure discovery affects following decisions, which can be addressed by policy gradient RL. Results show that our method can learn task-friendly representations by identifying important words or task-relevant structures without explicit structure annotations, and thus yields competitive performance.

AAAI Conference 2018 Conference Paper

Reinforcement Learning for Relation Classification From Noisy Data

  • Jun Feng
  • Minlie Huang
  • Li Zhao
  • Yang Yang
  • Xiaoyan Zhu

Existing relation classification methods that rely on distant supervision assume that a bag of sentences mentioning an entity pair are all describing a relation for the entity pair. Such methods, performing classification at the bag level, cannot identify the mapping between a relation and a sentence, and largely suffers from the noisy labeling problem. In this paper, we propose a novel model for relation classification at the sentence level from noisy data. The model has two modules: an instance selector and a relation classifier. The instance selector chooses high-quality sentences with reinforcement learning and feeds the selected sentences into the relation classifier, and the relation classifier makes sentencelevel prediction and provides rewards to the instance selector. The two modules are trained jointly to optimize the instance selection and relation classification processes. Experiment results show that our model can deal with the noise of data effectively and obtains better performance for relation classification at the sentence level.

AAAI Conference 2018 Conference Paper

Word Attention for Sequence to Sequence Text Understanding

  • Lijun Wu
  • Fei Tian
  • Li Zhao
  • Jianhuang Lai
  • Tie-Yan Liu

Attention mechanism has been a key component in Recurrent Neural Networks (RNNs) based sequence to sequence learning framework, which has been adopted in many text understanding tasks, such as neural machine translation and abstractive summarization. In these tasks, the attention mechanism models how important each part of the source sentence is to generate a target side word. To compute such importance scores, the attention mechanism summarizes the source side information in the encoder RNN hidden states (i. e. , ht), and then builds a context vector for a target side word upon a subsequence representation of the source sentence, since ht actually summarizes the information of the subsequence containing the first t-th words in the source sentence. We in this paper, show that an additional attention mechanism called word attention, that builds itself upon word level representations, significantly enhances the performance of sequence to sequence learning. Our word attention can enrich the source side contextual representation by directly promoting the clean word level information in each step. Furthermore, we propose to use contextual gates to dynamically combine the subsequence level and word level contextual information. Experimental results on abstractive summarization and neural machine translation show that word attention significantly improve over strong baselines. In particular, we achieve the state-of-the-art result on WMT’14 English-French translation task with 12M training data.

IJCAI Conference 2017 Conference Paper

Sequence Prediction with Unlabeled Data by Reward Function Learning

  • Lijun Wu
  • Li Zhao
  • Tao Qin
  • Jianhuang Lai
  • Tie-Yan Liu

Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces \textit{reward} as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as \textit{pseudo reward} for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.

AAAI Conference 2016 Conference Paper

Semi-Supervised Multinomial Naive Bayes for Text Classification by Leveraging Word-Level Statistical Constraint

  • Li Zhao
  • Minlie Huang
  • Ziyu Yao
  • Rongwei Su
  • Yingying Jiang
  • Xiaoyan Zhu

Multinomial Naive Bayes with Expectation Maximization (MNB-EM) is a standard semi-supervised learning method to augment Multinomial Naive Bayes (MNB) for text classification. Despite its success, MNB-EM is not stable, and may succeed or fail to improve MNB. We believe that this is because MNB-EM lacks the ability to preserve the class distribution on words. In this paper, we propose a novel method to augment MNB- EM by leveraging the word-level statistical constraint to preserve the class distribution on words. The word-level statistical constraints are further converted to constraints on document posteriors generated by MNB-EM. Experiments demonstrate that our method can consistently improve MNB- EM, and outperforms state-of-art baselines remarkably.