Author name cluster

Zihan Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers

2 author rows

TIST Journal 2026 Journal Article

BH 3 -MedRec: Bilateral Hierarchical Heterogeneous Hypergraph Convolution Network for Medication Recommendation

Zihan Zhang
Hongzhi Liu
Tianqi Sun
Xiaoshuang Guo
Zhonghai Wu

The development of artificial intelligence and medical informatics has empowered the medication recommendation systems with enhanced capabilities. However, existing methods struggle with the data imbalance problem in Electronic Health Records (EHRs), where the majority of records are concentrated on a limited subset of common diagnoses, procedures, and medications. It hampers the models’ ability to recommend appropriate medications when dealing with uncommon or multifaceted cases. In addition, existing approaches often fail to adequately model the complex relationships inherent in heterogeneous medical data sources, especially medication molecular structure information. This gap restricts the potential for uncovering meaningful associations among diverse clinical entities. To address these issues, we design a hierarchical attention-based pretraining strategy, leveraging the semantic hierarchies of medical entity codes to facilitate knowledge transfer, so as to alleviate the challenge of data imbalance. Furthermore, we design a novel bilateral hierarchical heterogeneous hypergraph convolution network for medication recommendation. Specifically, we construct specialized hypergraphs for both EHR data and medication molecular structure data, enabling hypergraph convolution to capture high-order relationships while promoting bilateral knowledge enhancement between these heterogeneous data sources. This comprehensive integration allows the model to effectively capture the relationships among clinical and molecular information. Experimental results on different hospital departments of MIMIC-III and MIMIC-IV datasets demonstrate the superior performance of our model compared to state-of-the-art methods. Our source code is released at: https://github.com/LusiaZ/BH3-MedRec.

Details DOI

AAAI Conference 2026 Conference Paper

Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration

Yuhang Han
Xuyang Liu
Zihan Zhang
Pengxiang Ding
Junjie Chen
Honggang Chen
Donglin Wang
Qingsen Yan

The quadratic complexity of Multimodal Large Language Models (MLLMs) with respect to context length poses significant computational and memory challenges, hindering their real-world deployment. In the paper, we devise a ''filter-correlate-compress'' framework to accelerate the MLLM by systematically optimizing multimodal context length during prefilling. The framework first implements FiCoCo-V, a training-free method operating within the vision encoder. It employs a redundancy-based token discard mechanism that uses a novel integrated metric to accurately filter out redundant visual tokens. To mitigate information loss, the framework introduces a correlation-based information recycling mechanism that allows preserved tokens to selectively recycle information from correlated discarded tokens with a self-preserving compression, thereby preventing the dilution of their own core content. The framework's FiCoCo-L variant further leverages task-aware textual priors to perform token reduction directly within the LLM decoder. Extensive experiments demonstrate that the FiCoCo series effectively accelerates a range of MLLMs, achieves up to 14.7× FLOPs reduction with 93.6% performance retention. Our methods consistently outperform state-of-the-art training-free approaches, showcasing effectiveness and generalizability across model architectures, sizes, and tasks without requiring retraining.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GraphRAG-Induced Dual Knowledge Structure Graphs for Personalized Learning Path Recommendation

Xinghe Cheng
Zihan Zhang
Jiapu Wang
Liangda Fang
Chaobo He
Quanlong Guan
Shirui Pan
Weiqi Luo

Learning path recommendation seeks to provide students with a structured sequence of learning items (e.g., knowledge concepts or exercises) to optimize their learning efficiency. Despite significant efforts in this area, most existing methods primarily rely on prerequisite relations, which present two major limitations: (1) Prerequisite relations between knowledge concepts are difficult to obtain due to the cost of expert annotation, hindering the application of current learning path recommendation methods. (2) Relying on a single sequentially dependent knowledge structure based on prerequisite relations implies that a confusing knowledge concept can disrupt subsequent learning processes, which is referred to as blocked learning. To address these two challenges, we propose a novel approach, GraphRAG-Induced Dual Knowledge Structure Graphs for Personalized Learning Path Recommendation (KnowLP), which enhances learning path recommendations by incorporating both prerequisite and similarity relations between knowledge concepts. Specifically, we introduce a knowledge structure graph generation module EDU-GraphRAG that constructs knowledge structure graphs for different educational datasets, significantly improving the applicability of learning path recommendation methods. We then propose a Discrimination Learning-driven Reinforcement Learning (DLRL) module that utilizes similarity relations as fallback relations when prerequisite relations become ineffective, thereby alleviating the blocked learning. Finally, we conduct extensive experiments on three benchmark datasets, demonstrating that our method not only achieves state-of-the-art performance but also generates more effective and longer learning paths.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Knowledge-Enhanced Explainable Hypergraph Convolution Network for Medication Recommendation

Zihan Zhang
Hongzhi Liu
Xiaoshuang Guo
Tianqi Sun
Zhonghai Wu

Medication recommendation systems aim to provide personalized and safe medication options based on individual patient records. However, existing approaches often face challenges related to inadequate modeling of complex relationships within Electronic Health Records (EHRs), data sparsity, and a lack of explainability for recommendations. In this paper, we present a Knowledge-enhanced Explainable HyperGraph Convolution Network (KEHGCN) that constructs a hierarchical hypergraph structure to capture the multi-level relationships within EHR data. By incorporating external knowledge graphs, our approach introduces additional positive relations that help alleviate the impact of data sparsity on model learning. Furthermore, by performing generalized metapath construction and selection on the knowledge graph, our approach achieves effective knowledge filtering and extracts semantically meaningful metapaths, thereby further enhancing the explainability of the recommendation results. We also explicitly introduce negative relations present in the domain knowledge to improve the safety of medication recommendation. Extensive experiments on different hospital departments of MIMIC-III and MIMIC-IV datasets demonstrate that KEHGCN outperforms other state-of-the-art baselines.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

Zihan Zhang
Xiangyang Ji
Yuan Zhou 0007

We study the optimal batch-regret tradeoff for batch linear contextual bandits. For this problem, we design batch learning algorithms and prove that they achieve the optimal regret bounds (up to logarithmic factors) for any batch number $M$, number of actions $K$, time horizon $T$, and dimension $d$. Therefore, we establish the \emph{full-parameter-range} (almost) optimal batch-regret tradeoff for the batch linear contextual bandit problem. Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.

Details

NeurIPS Conference 2025 Conference Paper

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

Zihan Zhang
Yuxin Chen
Jason Lee
Simon Du
Lin Yang
Ruosong Wang

We study deployment-efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of distinct policies implemented during learning. By ``deployment efficient'', we mean algorithms that require few policies deployed during exploration -- crucial in real-world applications where such deployments are costly or disruptive. We design a novel reinforcement learning algorithm that achieves near-optimal deployment efficiency for linear MDPs in the reward-free setting, using at most $H$ exploration policies during execution (where $H$ is the horizon length), while maintaining sample complexity polynomial in feature dimension and horizon length. Unlike previous approaches with similar deployment efficiency guarantees, our algorithm's sample complexity is independent of the reachability or explorability coefficients of the underlying MDP, which can be arbitrarily small and lead to unbounded sample complexity in certain cases -- directly addressing an open problem from prior work. Our technical contributions include a data-dependent method for truncating state-action pairs in linear MDPs, efficient offline policy evaluation and optimization algorithms for these truncated MDPs, and a careful integration of these components to implement reward-free exploration with linear function approximation without sacrificing deployment efficiency.

PDF Details

IROS Conference 2025 Conference Paper

MapDiffusion: Generative Diffusion for Vectorized Online HD Map Construction and Uncertainty Estimation in Autonomous Driving

Thomas Monninger
Zihan Zhang
Zhipeng Mo
Md Zafar Anwar
Steffen Staab
Sihao Ding 0004

Autonomous driving requires an understanding of the static environment from sensor data. Learned Bird’s-Eye View (BEV) encoders are commonly used to fuse multiple inputs, and a vector decoder predicts a vectorized map representation from the latent BEV grid. However, traditional map construction models provide deterministic point estimates, failing to capture uncertainty and the inherent ambiguities of real-world environments, such as occlusions and missing lane markings. We propose MapDiffusion, a novel generative approach that leverages the diffusion paradigm to learn the full distribution of possible vectorized maps. Instead of predicting a single deterministic output from learned queries, MapDiffusion iteratively refines randomly initialized queries, conditioned on a BEV latent grid, to generate multiple plausible map samples. This allows aggregating samples to improve prediction accuracy and deriving uncertainty estimates that directly correlate with scene ambiguity. Extensive experiments on the nuScenes dataset demonstrate that MapDiffusion achieves state-of-the-art performance in online map construction, surpassing the baseline by 5% in single-sample performance. We further show that aggregating multiple samples consistently improves performance along the ROC curve, validating the benefit of distribution modeling. Additionally, our uncertainty estimates are significantly higher in occluded areas, reinforcing their value in identifying regions with ambiguous sensor input. By modeling the full map distribution, MapDiffusion enhances the robustness and reliability of online vectorized HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles in complex environments.

Details

ICML Conference 2025 Conference Paper

Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Zihan Zhang
Yuxin Chen 0002
Jason D. Lee
Simon S. Du
Ruosong Wang

In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes. To achieve this result, our new technical ingredients include (i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.

Details

JBHI Journal 2025 Journal Article

Prediction of circRNA-Drug Associations Based on Bipartite Graph Transformer

Zihan Zhang
Yuchen Zhang
Xiujuan Lei

Circular RNAs (circRNAs) represent a distinctive class of non-coding RNAs with covalently closed loop structures that play crucial regulatory roles in drug response. While existing computational methods have achieved certain progress in prediction tasks, they primarily relied on circRNA genotypes and traditional molecular fingerprints, with limited utilization of multi-omics data and inadequate consideration of heterogeneous network topology. To address these limitations, this study proposed the CircRNA-Drug Bipartite Graph Transformer (CDBGT) framework to predict associations. Rather than limiting to associations between circRNA genotypes and drugs, this study integrated circRNA-drug response and target association information from multiple databases. CDBGT employed pre-trained models RNA-FM and ChemBERTa to extract features of sequence and molecular fingerprint and utilized multi-omics data to construct similarity matrices. The framework incorporated a bipartite graph transformer with topological positional encoding, comprehensively considering degree encoding, degree ranking encoding and spectral encoding to extract topological information from heterogeneous networks. Experimental results showed that CDBGT performed stably in 5-fold cross-validation. On the Response dataset, it achieved ROC-AUC of 0. 9674 and PR-AUC of 0. 9540, while on the Target dataset it reached ROC-AUC of 0. 8621. Compared with existing methods, it showed an improvement of 3. 20 to 26. 87 percentage points in ROC-AUC. Ablation experiments demonstrated the necessity of each module. Through literature-supported case studies, this work suggested potential directions for circRNA-based therapeutic research.

Details DOI

IROS Conference 2025 Conference Paper

Robust Model-Free Path Tracking Algorithm for Hydraulic Center-Articulated Scooptrams

Zihan Zhang
Cunguang Fang
Pu Xu
Zheng Fang

This paper proposes a model-free steering control method to address the path tracking challenges of Hydraulic Center-articulated Scooptrams (HCS) in narrow underground mining environments. Due to the nonlinear and time-delay characteristics of the hydraulic steering system, the HCS exhibits response lag when executing control commands. The lag time demonstrates dynamic uncertainty influenced by operating conditions, hydraulic pressure, and load variations. To address this challenge, an adaptive steering control strategy is designed. This strategy leverages the geometric relationship between the HCS and the reference path to dynamically adjust the look-ahead distance, thereby compensating for the uncertainty caused by the hydraulic system lag. Additionally, the error is mapped to the actual control input in real-time through a feedback error controller, effectively correcting control errors caused by lag without relying on a complex hydraulic system model. The proposed method was experimentally validated in a full-scale simulated mining tunnel, demonstrating considerable robustness and precise path tracking performance under uneven terrain, heavy loads, significant initial error, and bidirectional movement. This method provides a viable solution for the autonomous navigation of the HCS.

Details

NeurIPS Conference 2025 Conference Paper

Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs

Shulun Chen
Runlong Zhou
Zihan Zhang
Maryam Fazel
Simon Du

We consider gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm (Zhang et al. [2024]) achieves a variance-aware gap-dependent regret bound of $$\tilde{O}\left(\left(\sum_{\Delta_h(s, a)>0} \frac{H^2 \log K \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s, a)} +\sum_{\Delta_h(s, a)=0}\frac{ H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right), $$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\tilde{O}$ hides $\mathsf{poly} \log (S, A, H, 1 / \Delta\_{\mathrm{min}}, 1 / \delta)$ terms. Here, $\Delta_h(s, a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $\Delta_{\mathrm{min}}: = \min_{\Delta_h (s, a) > 0} \Delta_h(s, a)$. The term $\mathtt{Var}\_{\max}^{\textup{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(\pi, h, s)$ tuples of the expected total variance under policy $\pi$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}\_{\max}^{\textup{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $$\Omega \left( \sum_{\Delta_h(s, a)>0} \frac{H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s, a)}\cdot \log K\right), $$ demonstrating the necessity of dependence on $\mathtt{Var}\_{\max}^{\textup{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.

PDF Details

NeurIPS Conference 2024 Conference Paper

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Victor Boone
Zihan Zhang

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first *tractable* algorithm with minimax optimal regret of $\mathrm{O}\left(\sqrt{\mathrm{sp}(h^*) S A T \log(SAT)}\right)$ where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S\times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, **P**rojected **M**itigated **E**xtended **V**alue **I**teration (`PMEVI`), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to obtain improved regret bounds.

PDF Details DOI

JMLR Journal 2024 Journal Article

Classification with Deep Neural Networks and Logistic Loss

Zihan Zhang
Lei Shi
Ding-Xuan Zhou

Deep neural networks (DNNs) trained with the logistic loss (also known as the cross entropy loss) have made impressive advancements in various binary classification tasks. Despite the considerable success in practice, generalization analysis for binary classification with deep neural networks and the logistic loss remains scarce. The unboundedness of the target function for the logistic loss in binary classification is the main obstacle to deriving satisfactory generalization bounds. In this paper, we aim to fill this gap by developing a novel theoretical analysis and using it to establish tight generalization bounds for training fully connected ReLU DNNs with logistic loss in binary classification. Our generalization analysis is based on an elegant oracle-type inequality which enables us to deal with the boundedness restriction of the target function. Using this oracle-type inequality, we establish generalization bounds for fully connected ReLU DNN classifiers $\hat{f}^{\text{FNN}}_n$ trained by empirical logistic risk minimization with respect to i.i.d. samples of size $n$, which lead to sharp rates of convergence as $n\to\infty$. In particular, we obtain optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) only requiring the Hölder smoothness of the conditional class probability $\eta$ of data. Moreover, we consider a compositional assumption that requires $\eta$ to be the composition of several vector-valued multivariate functions of which each component function is either a maximum value function or a Hölder smooth function only depending on a small number of its input variables. Under this assumption, we can even derive optimal convergence rates for $\hat{f}^{\text{FNN}}_n$ (up to some logarithmic factor) which are independent of the input dimension of data. This result explains why in practice DNN classifiers can overcome the curse of dimensionality and perform well in high-dimensional classification problems. Furthermore, we establish dimension-free rates of convergence under other circumstances such as when the decision boundary is piecewise smooth and the input data are bounded away from it. Besides the novel oracle-type inequality, the sharp convergence rates presented in our paper also owe to a tight error bound for approximating the natural logarithm function near zero (where it is unbounded) by ReLU DNNs. In addition, we justify our claims for the optimality of rates by proving corresponding minimax lower bounds. All these results are new in the literature and will deepen our theoretical understanding of classification with deep neural networks. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

PDF Details

ICLR Conference 2024 Conference Paper

Horizon-Free Regret for Linear Markov Decision Processes

Zihan Zhang
Jason D. Lee
Yuxin Chen 0002
Simon S. Du

A recent line of works showed regret bounds in reinforcement learning (RL) can be (nearly) independent of planning horizon, a.k.a. the horizon-free bounds. However, these regret bounds only apply to settings where a polynomial dependency on the size of transition model is allowed, such as tabular Markov Decision Process (MDP) and linear mixture MDP. We give the first horizon-free bound for the popular linear MDP setting where the size of the transition model can be exponentially large or even uncountable. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions at different time steps, we directly estimate the value functions and confidence sets. We obtain the horizon-free bound by: (1) maintaining multiple weighted least square estimators for the value functions; and (2) a structural lemma which shows the maximal total variation of the inhomogeneous value functions is bounded by a polynomial factor of the feature dimension.

Details

IROS Conference 2024 Conference Paper

OSM vs HD Maps: Map Representations for Trajectory Prediction

Jing-Yan Liao
Parth Doshi
Zihan Zhang
David Paz
Henrik I. Christensen

High Definition (HD) Maps have long been favored for their precise depictions of static road elements. However, their accessibility constraints and vulnerability to rapid environmental changes impede the widespread deployment of highly map-reliant autonomous driving tasks, such as motion forecasting. In this context, we propose to leverage OpenStreetMap (OSM) as a promising alternative to HD Maps for long-term motion forecasting. The contributions of this work are threefold: firstly, we extend the application of OSM to long-horizon forecasting, doubling the forecasting horizon compared to previous studies. Secondly, through an expanded observation landscape and the integration of intersection priors, our OSM-based approach exhibits competitive performance, narrowing the gap with HD-map-based models. Lastly, we conduct an exhaustive context-aware analysis, providing deeper insights in motion forecasting across diverse scenarios as well as conducting class-aware comparisons. This research not only advances long-term motion forecasting with coarse map representations but additionally offers a scalable solution within the domain of autonomous driving.

Details

STOC Conference 2024 Conference Paper

Sampling Proper Colorings on Line Graphs Using (1+o(1))Δ Colors

Yulin Wang
Chihao Zhang 0001
Zihan Zhang

Details

AAAI Conference 2024 Conference Paper

Text Diffusion with Reinforced Conditioning

Yuxuan Liu
Tianchi Yang
Shaohan Huang
Zihan Zhang
Haizhen Huang
Furu Wei
Weiwei Deng
Feng Sun

Diffusion models have demonstrated exceptional capability in generating high-quality images, videos, and audio. Due to their adaptiveness in iterative refinement, they provide a strong potential for achieving better non-autoregressive sequence generation. However, existing text diffusion models still fall short in their performance due to a challenge in handling the discreteness of language. This paper thoroughly analyzes text diffusion models and uncovers two significant limitations: degradation of self-conditioning during training and misalignment between training and sampling. Motivated by our findings, we propose a novel Text Diffusion model called TReC, which mitigates the degradation with Reinforced Conditioning and the misalignment by Time-Aware Variance Scaling. Our extensive experiments demonstrate the competitiveness of TReC against autoregressive, non-autoregressive, and diffusion baselines. Moreover, qualitative analysis shows its advanced ability to fully utilize the diffusion process in refining samples.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting

Hejie Cui
Xinyu Fang
Zihan Zhang
Ran Xu
Xuan Kan
Xin Liu
Yue Yu
Manling Li

Images contain rich relational knowledge that can help machines understand the world. Existing methods on visual knowledge extraction often rely on the pre-defined format (e. g. , sub-verb-obj tuples) or vocabulary (e. g. , relation types), restricting the expressiveness of the extracted knowledge. In this work, we take a first exploration to a new paradigm of open visual knowledge extraction. To achieve this, we present OpenVik which consists of an open relational region detector to detect regions potentially containing relational knowledge and a visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest. We also explore two data enhancement techniques for diversifying the generated format-free visual knowledge. Extensive knowledge quality evaluations highlight the correctness and uniqueness of the extracted open visual knowledge by OpenVik. Moreover, integrating our extracted knowledge across various visual reasoning applications shows consistent improvements, indicating the real-world applicability of OpenVik.

PDF Details

ICML Conference 2023 Conference Paper

Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments

Runlong Zhou
Zihan Zhang
Simon S. Du

We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e. g. , enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al. , 2021a). We apply new analysis techniques to demonstrate that this algorithm enjoys variance-dependent bounds with respect to the norms we propose. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.

Details

NeurIPS Conference 2022 Conference Paper

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Zihan Zhang
Yuhang Jiang
Yuan Zhou
Xiangyang Ji

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S, A, H, K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S, A, H)\sqrt{K})$ regret, the number of batches is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model. ion model.

PDF Details

NeurIPS Conference 2021 Conference Paper

Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP

Zihan Zhang
Jiaqi Yang
Xiangyang Ji
Simon S. Du

This paper presents new \emph{variance-aware} confidence sets for linear bandits and linear mixture Markov Decision Processes (MDPs). With the new confidence sets, we obtain the follow regret bounds: For linear bandits, we obtain an $\widetilde{O}(\mathrm{poly}(d)\sqrt{1 + \sum_{k=1}^{K}\sigma_k^2})$ data-dependent regret bound, where $d$ is the feature dimension, $K$ is the number of rounds, and $\sigma_k^2$ is the \emph{unknown} variance of the reward at the $k$-th round. This is the first regret bound that only scales with the variance and the dimension but \emph{no explicit polynomial dependency on $K$}. When variances are small, this bound can be significantly smaller than the $\widetilde{\Theta}\left(d\sqrt{K}\right)$ worst-case regret bound. For linear mixture MDPs, we obtain an $\widetilde{O}(\mathrm{poly}(d, \log H)\sqrt{K})$ regret bound, where $d$ is the number of base models, $K$ is the number of episodes, and $H$ is the planning horizon. This is the first regret bound that only scales \emph{logarithmically} with $H$ in the reinforcement learning with linear function approximation setting, thus \emph{exponentially improving} existing results, and resolving an open problem in \citep{zhou2020nearly}. We develop three technical ideas that may be of independent interest: 1) applications of the peeling technique to both the input norm and the variance magnitude, 2) a recursion-based estimator for the variance, and 3) a new convex potential lemma that generalizes the seminal elliptical potential lemma.

PDF Details

AAAI Conference 2021 Conference Paper

Learning from My Friends: Few-Shot Personalized Conversation Systems via Social Networks

Zhiliang Tian
Wei Bi
Zihan Zhang
Dongkyu Lee
Yiping Song
Nevin L. Zhang

Personalized conversation models (PCMs) generate responses according to speaker preferences. Existing personalized conversation tasks typically require models to extract speaker preferences from user descriptions or their conversation histories, which are scarce for newcomers and inactive users. In this paper, we propose a few-shot personalized conversation task with an auxiliary social network. The task requires models to generate personalized responses for a speaker given a few conversations from the speaker and a social network. Existing methods are mainly designed to incorporate descriptions or conversation histories. Those methods can hardly model speakers with so few conversations or connections between speakers. To better cater for newcomers with few resources, we propose a personalized conversation model (PCM) that learns to adapt to new speakers as well as enabling new speakers to learn from resource-rich speakers. Particularly, based on a meta-learning based PCM, we propose a task aggregator (TA) to collect other speakers’ information from the social network. The TA provides prior knowledge of the new speaker in its meta-learning. Experimental results show our methods outperform all baselines in appropriateness, diversity, and consistency with speakers.

PDF Details

ICML Conference 2021 Conference Paper

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Zihan Zhang
Yuan Zhou 0007
Xiangyang Ji

In this paper we consider the problem of learning an $\epsilon$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $\gamma \in (0, 1)$, and an approximation threshold $\epsilon > 0$, we provide a model-free algorithm to learn an $\epsilon$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{5. 5}})$ \footnote{In this work, the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S, A, 1/(1-\gamma)$, and $1/\epsilon$. } and success probability $(1-p)$. For small enough $\epsilon$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{\epsilon^2(1-\gamma)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Details

ICML Conference 2021 Conference Paper

Near Optimal Reward-Free Reinforcement Learning

Zihan Zhang
Simon S. Du
Xiangyang Ji

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. %This framework is suitable for batch RL setting and the setting where there are multiple reward functions of interes We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase, where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$, in contrast to all existing results which scale \emph{polynomially} with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques: 1) A new sufficient condition for the dataset to plan for an $\epsilon$-suboptimal policy % for any totally bounded reward function; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

Details

NeurIPS Conference 2020 Conference Paper

Almost Optimal Model-Free Reinforcement Learningvia Reference-Advantage Decomposition

Zihan Zhang
Yuan Zhou
Xiangyang Ji

We study the reinforcement learning problem in the setting of finite-horizon1episodic Markov Decision Processes (MDPs) with S states, A actions, and episode length H. We propose a model-free algorithm UCB-ADVANTAGE and prove that it achieves \tilde{O}(\sqrt{H^2 SAT}) regret where T=KH and K is the number of episodes to play. Our regret bound improves upon the results of [Jin et al. , 2018] and matches the best known model-based algorithms as well as the information theoretic lower bound up to logarithmic factors. We also show that UCB-ADVANTAGE achieves low local switching cost and applies to concurrent reinforcement learning, improving upon the recent results of [Bai et al. , 2019].

PDF Details

IJCAI Conference 2020 Conference Paper

Argot: Generating Adversarial Readable Chinese Texts

Zihan Zhang
Mingxuan Liu
Chao Zhang
Yiming Zhang
Zhou Li
Qi Li
Haixin Duan
Donghong Sun

Natural language processing (NLP) models are known vulnerable to adversarial examples, similar to image processing models. Studying adversarial texts is an essential step to improve the robustness of NLP models. However, existing studies mainly focus on analyzing English texts and generating adversarial examples for English texts. There is no work studying the possibility and effect of the transformation to another language, e. g, Chinese. In this paper, we analyze the differences between Chinese and English, and explore the methodology to transform the existing English adversarial generation method to Chinese. We propose a novel black-box adversarial Chinese texts generation solution Argot, by utilizing the method for adversarial English samples and several novel methods developed on Chinese characteristics. Argot could effectively and efficiently generate adversarial Chinese texts with good readability. Furthermore, Argot could also automatically generate targeted Chinese adversarial text, achieving a high success rate and ensuring readability of the Chinese.

PDF Details DOI

NeurIPS Conference 2019 Conference Paper

AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification

Ronghui You
Zihan Zhang
Ziye Wang
Suyang Dai
Hiroshi Mamitsuka
Shanfeng Zhu

Extreme multi-label text classification (XMTC) is an important problem in the era of {\it big data}, for tagging a given text with the most relevant multiple labels from an extremely large-scale label set. XMTC can be found in many applications, such as item categorization, web page tagging, and news annotation. Traditionally most methods used bag-of-words (BOW) as inputs, ignoring word context as well as deep semantic information. Recent attempts to overcome the problems of BOW by deep learning still suffer from 1) failing to capture the important subtext for each label and 2) lack of scalability against the huge number of labels. We propose a new label tree-based deep learning model for XMTC, called AttentionXML, with two unique features: 1) a multi-label attention mechanism with raw text as input, which allows to capture the most relevant part of text to each label; and 2) a shallow and wide probabilistic label tree (PLT), which allows to handle millions of labels, especially for "tail labels". We empirically compared the performance of AttentionXML with those of eight state-of-the-art methods over six benchmark datasets, including Amazon-3M with around 3 million labels. AttentionXML outperformed all competing methods under all experimental settings. Experimental results also show that AttentionXML achieved the best performance against tail labels among label tree-based methods. The code and datasets are available at \url{http: //github. com/yourh/AttentionXML}.

PDF Details

NeurIPS Conference 2019 Conference Paper

Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Zihan Zhang
Xiangyang Ji

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. } for MDP with S states and A actions, in the case that an upper bound $H$ on the span of $h^{*}$, i. e. , $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$\cite{bartlett2009regal} by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SATH})$\cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{DSAT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{DSAT})$\cite{jaksch2010near}.

PDF Details

AAAI Conference 2016 Conference Paper

Multi-Domain Active Learning for Recommendation

Zihan Zhang
Xiaoming Jin
Lianghao Li
Guiguang Ding
Qiang Yang

Recently, active learning has been applied to recommendation to deal with data sparsity on a single domain. In this paper, we propose an active learning strategy for recommendation to alleviate the data sparsity in a multi-domain scenario. Speciﬁcally, our proposed active learning strategy simultaneously consider both speciﬁc and independent knowledge over all domains. We use the expected entropy to measure the generalization error of the domain-speciﬁc knowledge and propose a variance-based strategy to measure the generalization error of the domain-independent knowledge. The proposed active learning strategy use a uniﬁed function to effectively combine these two measurements. We compare our strategy with ﬁve state-of-the-art baselines on ﬁve different multi-domain recommendation tasks, which are constituted by three real-world data sets. The experimental results show that our strategy performs signiﬁcantly better than all the baselines and reduces human labeling efforts by at least 5. 6%, 8. 3%, 11. 8%, 12. 5% and 15. 4% on the ﬁve tasks, respectively.

PDF Details