Arrow Research search

Author name cluster

Yuzhi Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

  • Jingquan Yan
  • Yuwei Miao
  • Lei Yu
  • Yuzhi Guo
  • Xue Xiao
  • Lin Xu
  • Junzhou Huang

Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene–phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout-induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout-induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human-interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric Fmax and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

AAAI Conference 2026 Conference Paper

Learning from Guidelines: Structured Prompt Optimization for Expert Annotation Tasks

  • Wenliang Zhong
  • Haiqing Li
  • Thao M. Dang
  • Feng Jiang
  • Hehuan Ma
  • Yuzhi Guo
  • Jean Gao
  • Junzhou Huang

Deep learning has significantly advanced numerous fields by training on extensive annotated datasets. However, this data-driven paradigm faces limitations such as limited adaptability and high annotation costs, particularly when precise adherence to detailed, domain-specific guidelines is required in annotation. This challenge raises a critical question: Can models effectively shift from data-driven learning to autonomously leveraging guidelines with minimal annotated examples? To address this, we propose the Guideline-Driven Prompt (GDP) optimization framework, which shifts the learning paradigm from data-driven training to guideline-driven reasoning. GDP leverages Retrieval Augmented Generation (RAG) to retrieve essential fragments from complex guidelines and synthesize them into structured, executable prompts. A tree-based optimization algorithm systematically constructs and refines these prompts, explicitly capturing the intricate logic embedded in professional guidelines through a latent pipeline structure. Empirical evaluations on four datasets ranging from diverse domains and different tasks demonstrate that GDP effectively transitions the learning process from data-intensive methods to a guideline-driven approach in tasks requiring detailed and complex guideline adherence, reducing dependence on extensive annotated datasets.

AAAI Conference 2025 Conference Paper

GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction

  • Yuwei Miao
  • Yuzhi Guo
  • Hehuan Ma
  • Jingquan Yan
  • Feng Jiang
  • Rui Liao
  • Junzhou Huang

Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.

NeurIPS Conference 2025 Conference Paper

TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence

  • Feng Jiang
  • Mangal Prakash
  • Hehuan Ma
  • Jianyuan Deng
  • Yuzhi Guo
  • Maolaaisha Aminanmu
  • Tommaso Mansi
  • Rui Liao

Molecular property prediction aims to learn representations that map chemical structures to functional properties. While multimodal learning has emerged as a powerful paradigm to learn molecular representations, prior works have largely overlooked textual and taxonomic information of molecules for representation learning. We introduce TRIDENT, a novel framework that integrates molecular SMILES, textual descriptions, and taxonomic functional annotations to learn rich molecular representations. To achieve this, we curate a comprehensive dataset of molecule-text pairs with structured, multi-level functional annotations. Instead of relying on conventional contrastive loss, TRIDENT employs a volume-based alignment objective to jointly align tri-modal features at the global level, enabling soft, geometry-aware alignment across modalities. Additionally, TRIDENT introduces a novel local alignment objective that captures detailed relationships between molecular substructures and their corresponding sub-textual descriptions. A momentum-based mechanism dynamically balances global and local alignment, enabling the model to learn both broad functional semantics and fine-grained structure-function mappings. TRIDENT achieves state-of-the-art performance on 18 downstream tasks, demonstrating the value of combining SMILES, textual, and taxonomic functional annotations for molecular property prediction. Our code and data are available at https: //github. com/uta-smile/TRIDENT.

AAAI Conference 2022 Conference Paper

Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures

  • Yuzhi Guo
  • Jiaxiang Wu
  • Hehuan Ma
  • Junzhou Huang

The protein tertiary structure largely determines its interaction with other molecules. Despite its importance in various structure-related tasks, fully-supervised data are often timeconsuming and costly to obtain. Existing pre-training models mostly focus on amino-acid sequences or multiple sequence alignments, while the structural information is not yet exploited. In this paper, we propose a self-supervised pre-training model for learning structure embeddings from protein tertiary structures. Native protein structures are perturbed with random noise, and the pre-training model aims at estimating gradients over perturbed 3D structures. Specifically, we adopt SE(3)-invariant features as the model inputs and reconstruct gradients over 3D coordinates with SE(3)equivariance preserved. Such a paradigm avoids the usage of sophisticated SE(3)-equivariant models, and dramatically improves the computational efficiency of pre-training models. We demonstrate the effectiveness of our pre-training model on two downstream tasks, protein structure quality assessment (QA) and protein-protein interaction (PPI) site prediction. Hierarchical structure embeddings are extracted to enhance corresponding prediction models. Extensive experiments indicate that such structure embeddings consistently improve the prediction accuracy for both downstream tasks.