Arrow Research search

Author name cluster

Lei Gu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
1 author row

Possible papers

4

NeurIPS Conference 2025 Conference Paper

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

  • Qihao Duan
  • Bingding Huang
  • Zhenqiao Song
  • Irina Lehmann
  • Lei Gu
  • Roland Eils
  • Benjamin Wild

Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genetics presents significant challenges. Capturing complex genomic interactions requires modeling long-range global dependencies within DNA sequences, where interactions often span over 10, 000 base pairs, even within a single gene. This poses substantial computational demands under conventional model architectures and training paradigms. Additionally, traditional LLM training approaches are suboptimal for DNA sequences: autoregressive training, while efficient for training, only supports unidirectional sequence understanding. However, DNA is inherently bidirectional. For instance, bidirectional promoters regulate gene expression in both directions and govern approximately 11% of human gene expression. Masked language models (MLMs) enable bidirectional understanding. However, they are inefficient since only masked tokens contribute to loss calculations at each training step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm, integrating the optimization efficiency of autoregressive modeling with the bidirectional comprehension capability of masked modeling. JanusDNA's architecture leverages a Mamba-Attention Mixture-of-Experts (MoE) design, combining the global, high-resolution context awareness of attention mechanisms with the efficient sequential representation learning capabilities of Mamba. The MoE layers further enhance the model's capacity through sparse parameter scaling, while maintaining manageable computational costs. Notably, JanusDNA can process up to 1 million base pairs at single-nucleotide resolution on a single 80GB GPU using its hybrid architecture. Extensive experiments and ablation studies demonstrate that JanusDNA achieves new state-of-the-art performance on three genomic representation benchmarks. Remarkably, JanusDNA surpasses models with 250x more activated parameters, underscoring its efficiency and effectiveness. Code available at https: //anonymous. 4open. science/r/JanusDNA/.

AAAI Conference 2025 Conference Paper

Representation Learning Based Predicate Invention on Knowledge Graphs

  • Man Zhu
  • Pengfei Huang
  • Lei Gu
  • Xiaolong Xu
  • Jingyu Han

The recognition of whether or not a predicate should be invented is an important problem in the domain of predicate invention. Despite its significance, existing research has yet to fully harness the rich data available in knowledge graphs. In this paper, we introduce a novel problem formulation, ReLPI (Representation Learning for Predicate Invention in Knowledge Graphs), marking a pioneering effort in this domain. To address the core issues of ReLPI, we devise a scoring function that informs the learning process. By optimizing embeddings towards this scoring function, we endow them with semantic meaning, crucial for capturing the nuances of predicate presence patterns. Furthermore, we present SEmPI (Semantic Embeddings for Predicate Invention), a framework that leverages predicate (relation) embeddings as a trainable medium. SEmPI uncovers latent patterns governing predicate occurrences in knowledge graphs, enabling the invention of novel predicates grounded in these discovered patterns. This approach represents a significant step forward in leveraging data-driven methods for predicate invention in knowledge graphs. We evaluate the proposed approach on FB15k and DRKG datasets, and the results demonstrate the effectiveness of SEmPI in discovering new predicates.

AAAI Conference 2018 Conference Paper

An Euclidean Distance Based on Tensor Product Graph Diffusion Related Attribute Value Embedding for Nominal Data Clustering

  • Lei Gu
  • Ningning Zhou
  • Yang Zhao

Not like numerical data clustering, nominal data clustering is a very difficult problem because there exists no natural relative ordering between nominal attribute values. This paper mainly aims to make the Euclidean distance measure appropriate to nominal data clustering, and the core idea is the attribute value embedding, namely, transforming each nominal attribute value into a numerical vector. This embedding method consists of four steps. In the first step, the weights, which can quantify the amount of information in attribute values, is calculated for each value in each nominal attribute based on each object and its k nearest neighbors. In the second step, an intra-attribute value similarity matrix is created for each nominal attribute by using the attribute value’s weights. In the third step, for each nominal attribute, we find another attribute with the maximal dependence on it, and build an inter-attribute value similarity matrix on the basis of the attribute value’s weights related to these two attributes. In the last step, a diffusion matrix of each nominal attribute is constructed by the tensor product graph diffusion process, and this step can cause the acquired value embedding to contain simultaneously the intra- and inter-attribute value similarities information. To evaluate the effectiveness of our proposed method, experiments are done on 10 data sets. Experimental results demonstrate that our method not only enables the Euclidean distance to be used for nominal data clustering, but also can acquire the better clustering performance than several existing state-of-the-art approaches.