Arrow Research search

Author name cluster

Long Zhao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
1 author row

Possible papers

14

JBHI Journal 2026 Journal Article

MDDTA: A Drug Target Binding Affinity Prediction Method Based on Molecular Dynamics Simulation Data Enhancement

  • Long Zhao
  • Hongmei Wang
  • Ximin Zeng
  • Shaoping Shi

Deep learning-based methods for drug target binding affinity (DTA) prediction are improving the efficien cy of drug screening, but some limitations persist in current methodologies. Notably, prevailing models predominantly rely on static structural data while neglecting the conformational dynamics of drug target complexes, which compromises their capacity to discern subtle conformation dependent affinity variations. To address this issue, we first constructed MD-PDBbind, an enhanced sampled molecular dynamics simulation (MD) dataset. Building upon this foundation, the MDDTA model incorporating the novel FAFormer architecture was proposed to achieve (3) equivariance and invariance, allowing the model to better learn the geo metric information of the drug target complexes. Further more, we formulated a dynamic-aware loss function to enhance the adaptability of model to diverse conformations. The MDDTA demonstrates excellent scoring and ranking performance on the CASF-2016 dataset, with a case study providing intuitive validation of the effectiveness of incor porating dynamic information. Lastly, a drug screening process was developed using the MDDTA to screen 70 SARS CoV-2 candidate compounds, five of which have been validated in the literature. These results highlight the potential of MDDTA for practical drug screening.

NeurIPS Conference 2025 Conference Paper

InstructSAM: A Training-free Framework for Instruction-Oriented Remote Sensing Object Recognition

  • Yijie Zheng
  • Weijie Wu
  • Qingyun Li
  • Xuehui Wang
  • Xu Zhou
  • Aiai Ren
  • Jun Shen
  • Long Zhao

Language-guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89\% and overall runtime by over 32\% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems. The code is available at https: //VoyagerXvoyagerx. github. io/InstructSAM.

AAAI Conference 2024 Conference Paper

MINES: Message Intercommunication for Inductive Relation Reasoning over Neighbor-Enhanced Subgraphs

  • Ke Liang
  • Lingyuan Meng
  • Sihang Zhou
  • Wenxuan Tu
  • Siwei Wang
  • Yue Liu
  • Meng Liu
  • Long Zhao

GraIL and its variants have shown their promising capacities for inductive relation reasoning on knowledge graphs. However, the uni-directional message-passing mechanism hinders such models from exploiting hidden mutual relations between entities in directed graphs. Besides, the enclosing subgraph extraction in most GraIL-based models restricts the model from extracting enough discriminative information for reasoning. Consequently, the expressive ability of these models is limited. To address the problems, we propose a novel GraIL-based framework, termed MINES, by introducing a Message Intercommunication mechanism on the Neighbor-Enhanced Subgraph. Concretely, the message intercommunication mechanism is designed to capture the omitted hidden mutual information. It introduces bi-directed information interactions between connected entities by inserting an undirected/bi-directed GCN layer between uni-directed RGCN layers. Moreover, inspired by the success of involving more neighbors in other graph-based tasks, we extend the neighborhood area beyond the enclosing subgraph to enhance the information collection for inductive relation reasoning. Extensive experiments prove the promising capacity of the proposed MINES from various aspects, especially for the superiority, effectiveness, and transfer ability.

AAAI Conference 2024 Conference Paper

Sample-Level Cross-View Similarity Learning for Incomplete Multi-View Clustering

  • Suyuan Liu
  • Junpu Zhang
  • Yi Wen
  • Xihong Yang
  • Siwei Wang
  • Yi Zhang
  • En Zhu
  • Chang Tang

Incomplete multi-view clustering has attracted much attention due to its ability to handle partial multi-view data. Recently, similarity-based methods have been developed to explore the complete relationship among incomplete multi-view data. Although widely applied to partial scenarios, most of the existing approaches are still faced with two limitations. Firstly, fusing similarities constructed individually on each view fails to yield a complete unified similarity. Moreover, incomplete similarity generation may lead to anomalous similarity values with column sum constraints, affecting the final clustering results. To solve the above challenging issues, we propose a Sample-level Cross-view Similarity Learning (SCSL) method for Incomplete Multi-view Clustering. Specifically, we project all samples to the same dimension and simultaneously construct a complete similarity matrix across views based on the inter-view sample relationship and the intra-view sample relationship. In addition, a simultaneously learning consensus representation ensures the validity of the projection, which further enhances the quality of the similarity matrix through the graph Laplacian regularization. Experimental results on six benchmark datasets demonstrate the ability of SCSL in processing incomplete multi-view clustering tasks. Our code is publicly available at https://github.com/Tracesource/SCSL.

TMLR Journal 2024 Journal Article

VideoGLUE: Video General Understanding Evaluation of Foundation Models

  • Liangzhe Yuan
  • Nitesh Bharadwaj Gundavarapu
  • Long Zhao
  • Hao Zhou
  • Yin Cui
  • Lu Jiang
  • Xuan Yang
  • Menglin Jia

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore,we jointly profile FMs’ efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos,localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

AAAI Conference 2022 Conference Paper

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

  • Zizhao Zhang
  • Han Zhang
  • Long Zhao
  • Ting Chen
  • Sercan Ö. Arik
  • Tomas Pfister

Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciouslyselected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8 times faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https: //github. com/ google-research/nested-transformer.

NeurIPS Conference 2021 Conference Paper

Improved Transformer for High-Resolution GANs

  • Long Zhao
  • Zizhao Zhang
  • Ting Chen
  • Dimitris Metaxas
  • Han Zhang

Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs). In this paper, we introduce two key ingredients to Transformer to address this challenge. First, in low-resolution stages of the generative process, standard global self-attention is replaced with the proposed multi-axis blocked self-attention which allows efficient mixing of local and global attention. Second, in high-resolution stages, we drop self-attention while only keeping multi-layer perceptrons reminiscent of the implicit neural function. To further improve the performance, we introduce an additional self-modulation component based on cross-attention. The resulting model, denoted as HiT, has a nearly linear computational complexity with respect to the image size and thus directly scales to synthesizing high definition images. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 30. 83 and 2. 95 on unconditional ImageNet $128 \times 128$ and FFHQ $256 \times 256$, respectively, with a reasonable throughput. We believe the proposed HiT is an important milestone for generators in GANs which are completely free of convolutions. Our code is made publicly available at https: //github. com/google-research/hit-gan.

AAAI Conference 2021 Conference Paper

SMIL: Multimodal Learning with Severely Missing Modality

  • Mengmeng Ma
  • Jian Ren
  • Long Zhao
  • Sergey Tulyakov
  • Cathy Wu
  • Xi Peng

A common assumption in multimodal learning is the completeness of training data, i. e. , full modalities are available in all training examples. Although there exists research endeavor in developing novel methods to tackle the incompleteness of testing data, e. g. , modalities are partially missing in testing examples, few of them can handle incomplete training modalities. The problem becomes even more challenging if considering the case of severely missing, e. g. , ninety percent of training examples may have incomplete modalities. For the first time in the literature, this paper formally studies multimodal learning with missing modality in terms of flexibility (missing modalities in training, testing, or both) and efficiency (most training data have incomplete modality). Technically, we propose a new method named SMIL that leverages Bayesian meta-learning in uniformly achieving both objectives. To validate our idea, we conduct a series of experiments on three popular benchmarks: MM-IMDb, CMU-MOSI, and avMNIST. The results prove the state-of-the-art performance of SMIL over existing methods and generative baselines including autoencoders and generative adversarial networks.

NeurIPS Conference 2020 Conference Paper

Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

  • Long Zhao
  • Ting Liu
  • Xi Peng
  • Dimitris Metaxas

Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.

NeurIPS Conference 2019 Conference Paper

Rethinking Kernel Methods for Node Representation Learning on Graphs

  • Yu Tian
  • Long Zhao
  • Xi Peng
  • Dimitris Metaxas

Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification. However, the use of kernel methods for node classification, which is a related problem to graph representation learning, is still ill-posed and the state-of-the-art methods are heavily based on heuristics. Here, we present a novel theoretical kernel-based framework for node classification that can bridge the gap between these two representation learning problems on graphs. Our approach is motivated by graph kernel methodology but extended to learn the node representations capturing the structural information in a graph. We theoretically show that our formulation is as powerful as any positive semidefinite kernels. To efficiently learn the kernel, we propose a novel mechanism for node feature aggregation and a data-driven similarity metric employed during the training phase. More importantly, our framework is flexible and complementary to other graph-based deep learning models, e. g. , Graph Convolutional Networks (GCNs). We empirically evaluate our approach on a number of standard node classification benchmarks, and demonstrate that our model sets the new state of the art.

IJCAI Conference 2018 Conference Paper

CR-GAN: Learning Complete Representations for Multi-view Generation

  • Yu Tian
  • Xi Peng
  • Long Zhao
  • Shaoting Zhang
  • Dimitris N. Metaxas

Generating multi-view images from a single-view input is an important yet challenging problem. It has broad applications in vision, graphics, and robotics. Our study indicates that the widely-used generative adversarial network (GAN) may learn? incomplete? representations due to the single-pathway framework: an encoder-decoder network followed by a discriminator network. We propose CR-GAN to address this problem. In addition to the single reconstruction path, we introduce a generation sideway to maintain the completeness of the learned embedding space. The two learning paths collaborate and compete in a parameter-sharing manner, yielding largely improved generality to? unseen? dataset. More importantly, the two-pathway framework makes it possible to combine both labeled and unlabeled data for self-supervised learning, which further enriches the embedding space for realistic generations. We evaluate our approach on a wide range of datasets. The results prove that CR-GAN significantly outperforms state-of-the-art methods, especially when generating from? unseen? inputs in wild conditions.

IJCAI Conference 2016 Conference Paper

Bridging Saliency Detection to Weakly Supervised Object Detection Based on Self-Paced Curriculum Learning

  • Dingwen Zhang
  • Deyu Meng
  • Long Zhao
  • Junwei Han

Weakly-supervised object detection (WOD) is a challenging problems in computer vision. The key problem is to simultaneously infer the exact object locations in the training images and train the object detectors, given only the training images with weak image-level labels. Intuitively, by simulating the selective attention mechanism of human visual system, saliency detection technique can select attractive objects in scenes and thus is a potential way to provide useful priors for WOD. However, the way to adopt saliency detection in WOD is not trivial since the detected saliency region might be possibly highly ambiguous in complex cases. To this end, this paper first comprehensively analyzes the challenges in applying saliency detection to WOD. Then, we make one of the earliest efforts to bridge saliency detection to WOD via the self-paced curriculum learning, which can guide the learning procedure to gradually achieve faithful knowledge of multi-class objects from easy to hard. The experimental results demonstrate that the proposed approach can successfully bridge saliency detection and WOD tasks and achieve the state-of-the-art object detection results under the weak supervision.