Arrow Research search

Author name cluster

Long Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers
2 author rows

Possible papers

16

JBHI Journal 2025 Journal Article

BINDTI: A Bi-Directional Intention Network for Drug-Target Interaction Identification Based on Attention Mechanisms

  • Lihong Peng
  • Xin Liu
  • Long Yang
  • Longlong Liu
  • Zongzheng Bai
  • Min Chen
  • Xu Lu
  • Libo Nie

The identification of drug-target interactions (DTIs) is an essential step in drug discovery. In vitro experimental methods are expensive, laborious, and time-consuming. Deep learning has witnessed promising progress in DTI prediction. However, how to precisely represent drug and protein features is a major challenge for DTI prediction. Here, we developed an end-to-end DTI identification framework called BINDTI based on bi-directional Intention network. First, drug features are encoded with graph convolutional networks based on its 2D molecular graph obtained by its SMILES string. Next, protein features are encoded based on its amino acid sequence through a mixed model called ACmix, which integrates self-attention mechanism and convolution. Third, drug and target features are fused through bi-directional Intention network, which combines Intention and multi-head attention. Finally, unknown drug-target (DT) pairs are classified through multilayer perceptron based on the fused DT features. The results demonstrate that BINDTI greatly outperformed four baseline methods (i. e. , CPI-GNN, TransfomerCPI, MolTrans, and IIFDTI) on the BindingDB, BioSNAP, DrugBank, and Human datasets. More importantly, it was more appropriate to predict new DTIs than the four baseline methods on imbalanced datasets. Ablation experimental results elucidated that both bi-directional Intention and ACmix could greatly advance DTI prediction. The fused feature visualization and case studies manifested that the predicted results by BINDTI were basically consistent with the true ones. We anticipate that the proposed BINDTI framework can find new low-cost drug candidates, improve drugs' virtual screening, and further facilitate drug repositioning as well as drug discovery.

JBHI Journal 2025 Journal Article

DTI-MvSCA: An Anti-Over-Smoothing Multi-View Framework With Negative Sample Selection for Predicting Drug-Target Interactions

  • Lihong Peng
  • Zongzheng Bai
  • Longlong Liu
  • Long Yang
  • Xin Liu
  • Min Chen
  • Xing Chen

Predicting potential drug-target interactions (DTIs) facilitates to accelerate drug discovery and reduce development cost. Current deep learning-based methods exhibit high-performance predictions, but three challenges remain: first, the absence of negative DTIs severely limits the model performance. Moreover, existing graph neural networks are beset with the scalability due to the model complexity and graph size. More importantly, most methods focus on learning the topological features while ignoring node features during DTI representation learning. To solve the limitations, here, we develop a multi-view neural network framework called DTI-MvSCA for DTI identification. This framework begins with constructing a drug-protein pair (DPP) network with matrix operation-based negative DTI selection, and then learns the DPP representations through a M ulti- v iew neural network, finally classifies each DPP based on multilayer perceptron. Particularly, the multi-view neural network integrates graph topological feature learning based on the self-attention mechanism and S HADOW graph attention network, node feature learning based on 1D C onvolutional neural network, and the A ttention mechanism. An in-depth experiment on DrugBank V3. 0 and V5. 0 showed that DTI-MvSCA obtained precise and robust predictions against five state-of-the-art baseline methods. Furthermore, visualizing the feature distributions of the selected negative DTIs exhibits a more distinguishable and clearer boundary. In summary, DTI-MvSCA provides a useful deep learning tool to investigate potential DTIs.

IROS Conference 2025 Conference Paper

SimLauncher: Launching Sample-Efficient Real-World Robotic Reinforcement Learning via Simulation Pre-Training

  • Mingdong Wu
  • Lehong Wu
  • Yizhuo Wu
  • Weiyao Huang
  • Hongwei Fan
  • Zheyuan Hu
  • Haoran Geng
  • Jinzhou Li

Autonomous learning of dexterous, long-horizon robotic skills has been a longstanding pursuit of embodied AI. Recent advances in robotic reinforcement learning (RL) have demonstrated remarkable performance and robustness in real-world visuomotor control tasks. However, applying RL in the real world faces challenges such as low sample efficiency, slow exploration, and significant reliance on human intervention. In contrast, simulators offer a safe and efficient environment for extensive exploration and data collection, while the visual sim-to-real gap, often a limiting factor, can be mitigated using real-to-sim techniques. Building on these, we propose SimLauncher, a novel framework that combines the strengths of real-world RL and real-to-sim-to-real approaches to overcome these challenges. Specifically, we first pre-train a visuomotor policy in the digital twin simulation environment, which then benefits real-world RL in two ways: (1) bootstrapping target values using extensive simulated demonstrations and real-world demonstrations derived from pre-trained policy rollouts, and (2) Incorporating action proposals from the pre-trained policy for better exploration. We conduct comprehensive experiments across multi-stage, contact-rich, and dexterous hand manipulation tasks. Compared to prior real-world RL approaches, SimLauncher significantly improves sample efficiency and achieves near-perfect success rates. We hope this work serves as a proof of concept and inspires further research on leveraging large-scale simulation pre-training to benefit real-world robotic RL.

IJCAI Conference 2024 Conference Paper

FlagVNE: A Flexible and Generalizable Reinforcement Learning Framework for Network Resource Allocation

  • Tianfu Wang
  • Qilin Fan
  • Chao Wang
  • Long Yang
  • Leilei Ding
  • Nicholas Jing Yuan
  • Hui Xiong

Virtual network embedding (VNE) is an essential resource allocation task in network virtualization, aiming to map virtual network requests (VNRs) onto physical infrastructure. Reinforcement learning (RL) has recently emerged as a promising solution to this problem. However, existing RL-based VNE methods are limited by the unidirectional action design and one-size-fits-all training strategy, resulting in restricted searchability and generalizability. In this paper, we propose a flexible and generalizable RL framework for VNE, named FlagVNE. Specifically, we design a bidirectional action-based Markov decision process model that enables the joint selection of virtual and physical nodes, thus improving the exploration flexibility of solution space. To tackle the expansive and dynamic action space, we design a hierarchical decoder to generate adaptive action probability distributions and ensure high training efficiency. Furthermore, to overcome the generalization issue for varying VNR sizes, we propose a meta-RL-based training method with a curriculum scheduling strategy, facilitating specialized policy training for each VNR size. Finally, extensive experimental results show the effectiveness of FlagVNE across multiple key metrics. Our code is available at https: //github. com/GeminiLight/flag-vne.

NeurIPS Conference 2024 Conference Paper

Optimizing over Multiple Distributions under Generalized Quasar-Convexity Condition

  • Shihong Ding
  • Long Yang
  • Luo Luo
  • Cong Fang

We study a typical optimization model where the optimization variable is composed of multiple probability distributions. Though the model appears frequently in practice, such as for policy problems, it lacks specific analysis in the general setting. For this optimization problem, we propose a new structural condition/landscape description named generalized quasar-convexity (GQC) beyond the realms of convexity. In contrast to original quasar-convexity \citep{hinder2020near}, GQC allows an individual quasar-convex parameter $\gamma_i$ for each variable block $i$ and the smaller of $\gamma_i$ implies less block-convexity. To minimize the objective function, we consider a generalized oracle termed as the internal function that includes the standard gradient oracle as a special case. We provide optimistic mirror descent (OMD) for multiple distributions and prove that the algorithm can achieve an adaptive $\tilde{\mathcal{O}}((\sum_{i=1}^d1/\gamma_i)\epsilon^{-1})$ iteration complexity to find an $\varepsilon$-suboptimal global solution without pre-known the exact values of $\gamma_i$ when the objective admits ``polynomial-like'' structural. Notably, it achieves iteration complexity that does not explicitly depend on the number of distributions and strictly faster $(\sum_{i=1}^d 1/\gamma_i \text{ v. s. } d\max_{i\in[1: d]} 1/\gamma_i)$ than mirror decent methods. We also extend GQC to the minimax optimization problem proposing the generalized quasar-convexity-concavity (GQCC) condition and a decentralized variant of OMD with regularization. Finally, we show the applications of our algorithmic framework on discounted Markov Decision Processes problem and Markov games, which bring new insights on the landscape analysis of reinforcement learning.

AAAI Conference 2023 Conference Paper

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

  • Juntao Dai
  • Jiaming Ji
  • Long Yang
  • Qian Zheng
  • Gang Pan

Safe reinforcement learning considers practical scenarios that maximize the return while satisfying safety constraints. Current algorithms, which suffer from training oscillations or approximation errors, still struggle to update the policy efficiently with precise constraint satisfaction. In this article, we propose Augmented Proximal Policy Optimization (APPO), which augments the Lagrangian function of the primal constrained problem via attaching a quadratic deviation term. The constructed multiplier-penalty function dampens cost oscillation for stable convergence while being equivalent to the primal constrained problem to precisely control safety costs. APPO alternately updates the policy and the Lagrangian multiplier via solving the constructed augmented primal-dual problem, which can be easily implemented by any first-order optimizer. We apply our APPO methods in diverse safety-constrained tasks, setting a new state of the art compared with a comprehensive list of safe RL baselines. Extensive experiments verify the merits of our method in easy implementation, stable convergence, and precise cost control.

NeurIPS Conference 2023 Conference Paper

VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning

  • Jiayi Guan
  • Guang Chen
  • Jiaming Ji
  • Long Yang
  • Ao Zhou
  • Zhijun Li
  • Changjun Jiang

Offline safe reinforcement learning (RL) algorithms promise to learn policies that satisfy safety constraints directly in offline datasets without interacting with the environment. This arrangement is particularly important in scenarios with high sampling costs and potential dangers, such as autonomous driving and robotics. However, the influence of safety constraints and out-of-distribution (OOD) actions have made it challenging for previous methods to achieve high reward returns while ensuring safety. In this work, we propose a Variational Optimization with Conservative Eestimation algorithm (VOCE) to solve the problem of optimizing safety policies in the offline dataset. Concretely, we reframe the problem of offline safe RL using probabilistic inference, which introduces variational distributions to make the optimization of policies more flexible. Subsequently, we utilize pessimistic estimation methods to estimate the Q-value of cost and reward, which mitigates the extrapolation errors induced by OOD actions. Finally, extensive experiments demonstrate that the VOCE algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming state-of-the-art algorithms in terms of safety.

NeurIPS Conference 2022 Conference Paper

Constrained Update Projection Approach to Safe Policy Optimization

  • Long Yang
  • Jiaming Ji
  • Juntao Dai
  • Linrui Zhang
  • Binbin Zhou
  • Pengfei Li
  • Yaodong Yang
  • Gang Pan

Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe reinforcement learning meth- ods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance. 2) CUP unifies performance bounds, providing a better understanding and in- terpretability for some existing algorithms; 3) CUP provides a non-convex im- plementation via only first-order optimizers, which does not require any strong approximation on the convexity of the objectives. To validate our CUP method, we compared CUP against a comprehensive list of safe RL baselines on a wide range of tasks. Experiments show the effectiveness of CUP both in terms of reward and safety constraint satisfaction. We have opened the source code of CUP at https: //github. com/zmsn-2077/CUP-safe-rl.

IJCAI Conference 2022 Conference Paper

Penalized Proximal Policy Optimization for Safe Reinforcement Learning

  • Linrui Zhang
  • Li Shen
  • Long Yang
  • Shixiang Chen
  • Xueqian Wang
  • Bo Yuan
  • Dacheng Tao

Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple yet effective penalty approach to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the penalized method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.

AAAI Conference 2022 Conference Paper

Policy Optimization with Stochastic Mirror Descent

  • Long Yang
  • Yu Zhang
  • Gang Zheng
  • Qian Zheng
  • Pengfei Li
  • Jianhang Huang
  • Gang Pan

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O( −3 ) sample trajectories to achieve an -approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings.

AAAI Conference 2021 Conference Paper

On Convergence of Gradient Expected Sarsa(λ)

  • Long Yang
  • Gang Zheng
  • Yu Zhang
  • Qian Zheng
  • Pengfei Li
  • Gang Pan

We study the convergence of Expected Sarsa(λ) with function approximation. We show that with off-line estimate (multi-step bootstrapping) to Expected Sarsa(λ) is unstable for off-policy learning. Furthermore, based on convex-concave saddle-point framework, we propose a convergent Gradient Expected Sarsa(λ) (GES(λ)) algorithm. The theoretical analysis shows that the proposed GES(λ) converges to the optimal solution at a linear convergence rate under true gradient setting. Furthermore, we develop a Lyapunov function technique to investigate how the stepsize influences finite-time performance of GES(λ). Additionally, such a technique of Lyapunov function can be potentially generalized to other gradient temporal difference algorithms. Finally, our experiments verify the effectiveness of our GES(λ). For the details of proof, please refer to https: //arxiv. org/pdf/2012. 07199. pdf.

AAAI Conference 2021 Conference Paper

Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

  • Long Yang
  • Qian Zheng
  • Gang Pan

The policy-based reinforcement learning (RL) can be considered as maximization of its objective. However, due to the inherent non-concavity of its objective, the policy gradient method to a first-order stationary point (FOSP) cannot guarantee a maximal point. A FOSP can be a minimal or even a saddle point, which is undesirable for RL. It has be found that if all the saddle points are strict, all the second-order stationary points (SOSP) are exactly equivalent to local maxima. Instead of FOSP, we consider SOSP as the convergence criteria to characterize the sample complexity of policy gradient. Our result shows that policy gradient converges to an (, √ χ)- SOSP with probability at least 1 − e O(δ) after the total cost of O − 9 2 (1−γ) √ χ log 1 δ = O( − 9 2 ), where γ ∈ (0, 1). It significantly improves the state of the art cost e O( −9 ). Our analysis is based on the key idea that decomposes the parameter space Rp into three non-intersected regions: non-stationary point region, saddle point region, and local optimal region, then making a local improvement of the objective of RL in each region. This technique can be potentially generalized to extensive policy gradient methods. For the complete proof, please refer to https: //arxiv. org/pdf/2012. 01491. pdf.

AAMAS Conference 2019 Conference Paper

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

  • Longxiang Shi
  • Shijian Li
  • Longbing Cao
  • Long Yang
  • Gang Pan

Off-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(λ) and Naive Q(λ) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(σ), which effectively unifies the tree-backup algorithm and Naive Q(λ). By introducing a new parameter σ to illustrate the degree of utilizing traces, TBQ(σ) creates an effective integration of TB(λ) and Naive Q(λ) and continuous role shift between them. The contraction property of TB(σ) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ(σ) and give the convergence proof. We empirically show that, for ϵ ∈ (0, 1] in ϵ-greedy policies, there exists some degree of utilizing traces for λ ∈ [0, 1], which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.

IJCAI Conference 2018 Conference Paper

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

  • Long Yang
  • Minhao Shi
  • Qian Zheng
  • Wenjia Meng
  • Gang Pan

Recently, a new multi-step temporal learning algorithm Q(σ) unifies n-step Tree-Backup (when σ = 0) and n-step Sarsa (when σ = 1) by introducing a sampling parameter σ. However, similar to other multi-step temporal-difference learning algorithms, Q(σ) needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we combine the original Q(σ) with eligibility traces and propose a new algorithm, called Qπ(σ, λ), where λ is trace-decay parameter. This new algorithm unifies Sarsa(λ) (when σ = 1) and Qπ (λ) (when σ = 0). Furthermore, we give an upper error bound of Qπ(σ, λ) policy evaluation algorithm. We prove that Qπ (σ, λ) control algorithm converges to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of σ, Qπ(σ, λ) creates a mixture of the existing algorithms which learn the optimal value significantly faster than the extreme end (σ = 0, or 1).