Arrow Research search

Author name cluster

Keyan Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

AAAI Conference 2026 Conference Paper

Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

  • Yiwen Zhang
  • Keyan Ding
  • Yihang Wu
  • Xiang Zhuang
  • Yi Yang
  • Qiang Zhang
  • Huajun Chen

Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.

NeurIPS Conference 2025 Conference Paper

HiMoLE: Towards OOD-Robust LoRA via Hierarchical Mixture of Experts

  • Yinuo Jiang
  • Yan Xiaodong
  • Keyan Ding
  • Deng Zhao
  • Lei Liang
  • Qiang Zhang
  • Huajun Chen

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have enabled the efficient adaptation of large language models (LLMs) by updating only a small subset of parameters. However, their robustness under out-of-distribution (OOD) conditions remains insufficiently studied. In this paper, we identify the limitations of conventional LoRA in handling distributional shifts and propose $\textbf{HiMoLE}$($\textbf{Hi}$erarchical $\textbf{M}$ixture of $\textbf{L}$oRA $\textbf{E}$xperts), a new framework designed to improve OOD generalization. HiMoLE integrates hierarchical expert modules and hierarchical routing strategies into the LoRA architecture and introduces a two-phase training procedure enhanced by a diversity-driven loss. This design mitigates negative transfer and promotes effective knowledge adaptation across diverse data distributions. We evaluate HiMoLE on three representative tasks in natural language processing. Experimental results evidence that HiMoLE consistently outperforms existing LoRA-based approaches, significantly reducing performance degradation on OOD data while improving in-distribution performance. Our work bridges the gap between parameter efficiency and distributional robustness, advancing the practical deployment of LLMs in real-world applications.

JBHI Journal 2025 Journal Article

MPSol: A Multimodal Prompt Learning Framework for Protein Solubility Prediction

  • Yuhang Zhang
  • Peilin Chen
  • Keyan Ding
  • Han Liu
  • Shiqi Wang
  • Qi Song

Protein solubility is a critical determinant of biologic candidates’ developability, stability, and therapeutic efficacy. However, accurate solubility prediction remains a central challenge in computational protein engineering due to the inherent complexity within protein sequences. In this work, we propose a multimodal prompt learning framework, called MPSol, for protein solubility prediction that integrates complementary representations derived from primary sequences, structural proxies, and textual descriptions generated by large language models (LLMs). MPSol is built upon a unified multimodal backbone with a dedicated cross-modal fusion module that captures fine-grained interactions across modalities. In addition, we design label-aware prompts that encode solubility-specific semantic cues associated with each class. These prompts provide semantic supervision, guiding the alignment of fused protein representations to promote semantic consistency. Extensive experiments demonstrate that MPSol achieves state-of-the-art performance, reaching an accuracy of 0. 815, AUC of 0. 867 and MCC of 0. 642 on the standard PDBSol test set, and generalizes well to the external out-of-distribution test dataset with an accuracy of 0. 632, AUC of 0. 653 and MCC of 0. 332. These results underscore the potential of prompt-driven multimodal learning for interpretable and effective protein property prediction.

ICLR Conference 2025 Conference Paper

SaMer: A Scenario-aware Multi-dimensional Evaluator for Large Language Models

  • Kehua Feng
  • Keyan Ding
  • Jing Yu
  • Yiwen Qu
  • Zhiwen Chen 0002
  • Chengfei Lv
  • Gang Yu
  • Qiang Zhang 0026

Evaluating the response quality of large language models (LLMs) for open-ended questions poses a significant challenge, especially given the subjectivity and multi-dimensionality of "quality" in natural language generation. Existing LLM evaluators often neglect that different scenarios require distinct evaluation criteria. In this work, we propose **SaMer**, a scenario-aware multi-dimensional evaluator designed to provide both overall and fine-grained assessments of LLM-generated responses. Unlike fixed-dimension evaluation approaches, SaMer adapts to different scenarios by automatically identifying and prioritizing relevant evaluation dimensions tailored to the given query. To achieve this, we construct a large-scale fine-grained preference dataset spanning multiple real-world scenarios, each with distinct evaluation dimensions. We then leverage a text embedding model combined with three specialized heads to predict the appropriate evaluation dimensions and corresponding scores, as well as the respective weights that contribute to the overall score. The resulting model offers fine-grained and interpretable evaluations and shows robust adaptability across diverse scenarios. Extensive experiments on eight single rating and pairwise comparison datasets demonstrate that SaMer outperforms existing baselines in a variety of evaluation tasks, showcasing its robustness, versatility, and generalizability.

NeurIPS Conference 2024 Conference Paper

DePLM: Denoising Protein Language Models for Property Optimization

  • Zeyuan Wang
  • Keyan Ding
  • Ming Qin
  • Xiaotong Li
  • Xiang Zhuang
  • Yu Zhao
  • Jianhua Yao
  • Qiang Zhang

Protein optimization is a fundamental biological task aimed at enhancing theperformance of proteins by modifying their sequences. Computational methodsprimarily rely on evolutionary information (EI) encoded by protein languagemodels (PLMs) to predict fitness landscape for optimization. However, thesemethods suffer from a few limitations. (1) Evolutionary processes involve thesimultaneous consideration of multiple functional properties, often overshadowingthe specific property of interest. (2) Measurements of these properties tend to betailored to experimental conditions, leading to reduced generalizability of trainedmodels to novel proteins. To address these limitations, we introduce DenoisingProtein Language Models (DePLM), a novel approach that refines the evolutionaryinformation embodied in PLMs for improved protein optimization. Specifically, weconceptualize EI as comprising both property-relevant and irrelevant information, with the latter acting as “noise” for the optimization task at hand. Our approachinvolves denoising this EI in PLMs through a diffusion process conducted in therank space of property values, thereby enhancing model generalization and ensuringdataset-agnostic learning. Extensive experimental results have demonstrated thatDePLM not only surpasses the state-of-the-art in mutation effect prediction butalso exhibits strong generalization capabilities for novel proteins.

ICML Conference 2024 Conference Paper

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

  • Yuhao Wang
  • Qiang Zhang 0026
  • Ming Qin
  • Xiang Zhuang
  • Xiaotong Li
  • Zhichen Gong
  • Zeyuan Wang
  • Yu Zhao 0009

Directed evolution, a cornerstone of protein optimization, is to harness natural mutational processes to enhance protein functionality. Existing Machine Learning-assisted Directed Evolution (MLDE) methodologies typically rely on data-driven strategies and often overlook the profound domain knowledge in biochemical fields. In this paper, we introduce a novel Knowledge-aware Reinforced Language Model (KnowRLM) for MLDE. An Amino Acid Knowledge Graph (AAKG) is constructed to represent the intricate biochemical relationships among amino acids. We further propose a Protein Language Model (PLM)-based policy network that iteratively samples mutants through preferential random walks on the AAKG using a dynamic sliding window mechanism. The novel mutants are actively sampled to fine-tune a fitness predictor as the reward model, providing feedback to the knowledge-aware policy. Finally, we optimize the whole system in an active learning approach that mimics biological settings in practice. KnowRLM stands out for its ability to utilize contextual amino acid information from knowledge graphs, thus attaining advantages from both statistical patterns of protein sequences and biochemical properties of amino acids. Extensive experiments demonstrate the superior performance of KnowRLM in more efficiently identifying high-fitness mutants compared to existing methods.

ECAI Conference 2023 Conference Paper

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

  • Ming Qin
  • Keyan Ding
  • Bin Wu 0025
  • Zhenping Li
  • Haihong Yang
  • Zeyuan Wang
  • Hongbin Ye
  • Haoran Yu

Directed evolution is a widely-used strategy of protein engineering to improve protein function via mimicking natural mutation and selection. Machine learning-assisted directed evolution (MLDE) approaches aim to learn a fitness predictor, thereby efficiently searching for optimal mutants within the vast combinatorial mutation space. Since annotating mutants is both costly and labor-intensive, how to efficiently sample and utilize informative protein mutants to train the predictor is a critical problem in MLDE. Previous MLDE works just simply utilized pre-trained protein language models (PPLMs) for sampling without tailoring to the specific target protein of interest, which has not fully exploited the potential of PPLMs. In this work, we propose a novel method, the Actively-Finetuned Protein language model for Directed Evolution(AFP-DE), which leverages PPLMs to actively sample and fine-tune themselves, continuously improving the model’s sampling and overall performance through iterations, to achieve efficient directed protein evolution. Extensive experiments have shown the effectiveness of our method in generating optimal mutants with minimal annotation effort, outperforming previous works even with fewer annotated mutants, making it budget-friendly for biological experiments.

IJCAI Conference 2023 Conference Paper

Graph Sampling-based Meta-Learning for Molecular Property Prediction

  • Xiang Zhuang
  • Qiang Zhang
  • Bin Wu
  • Keyan Ding
  • Yin Fang
  • Huajun Chen

Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5. 71%-6. 93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https: //github. com/HICAI-ZJU/GS-Meta.

NeurIPS Conference 2023 Conference Paper

Learning Invariant Molecular Representation in Latent Discrete Space

  • Xiang Zhuang
  • Qiang Zhang
  • Keyan Ding
  • Yatao Bian
  • Xiao Wang
  • Jingsong Lv
  • Hongyang Chen
  • Huajun Chen

Molecular representation learning lays the foundation for drug discovery. However, existing methods suffer from poor out-of-distribution (OOD) generalization, particularly when data for training and testing originate from different environments. To address this issue, we propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts. Specifically, we propose a strategy called ``first-encoding-then-separation'' to identify invariant molecule features in the latent space, which deviates from conventional practices. Prior to the separation step, we introduce a residual vector quantization module that mitigates the over-fitting to training data distributions while preserving the expressivity of encoders. Furthermore, we design a task-agnostic self-supervised learning objective to encourage precise invariance identification, which enables our method widely applicable to a variety of tasks, such as regression and multi-label classification. Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts. Our code is available at https: //github. com/HICAI-ZJU/iMoLD.