Author name cluster

Keyan Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra

Yiwen Zhang
Keyan Ding
Yihang Wu
Xiang Zhuang
Yi Yang
Qiang Zhang
Huajun Chen

Retrieving molecular structures from tandem mass spectra is a crucial step in rapid compound identification. Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum. In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures, which are then used to re-rank the candidates based on molecular similarity. Experiments on both MassSpecGym and the proposed MassRET-20k dataset demonstrate that GLMR significantly outperforms existing methods, achieving over 40% improvement in top-1 accuracy and exhibiting strong generalizability.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

HiMoLE: Towards OOD-Robust LoRA via Hierarchical Mixture of Experts

Yinuo Jiang
Yan Xiaodong
Keyan Ding
Deng Zhao
Lei Liang
Qiang Zhang
Huajun Chen

Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have enabled the efficient adaptation of large language models (LLMs) by updating only a small subset of parameters. However, their robustness under out-of-distribution (OOD) conditions remains insufficiently studied. In this paper, we identify the limitations of conventional LoRA in handling distributional shifts and propose $\textbf{HiMoLE}$($\textbf{Hi}$erarchical $\textbf{M}$ixture of $\textbf{L}$oRA $\textbf{E}$xperts), a new framework designed to improve OOD generalization. HiMoLE integrates hierarchical expert modules and hierarchical routing strategies into the LoRA architecture and introduces a two-phase training procedure enhanced by a diversity-driven loss. This design mitigates negative transfer and promotes effective knowledge adaptation across diverse data distributions. We evaluate HiMoLE on three representative tasks in natural language processing. Experimental results evidence that HiMoLE consistently outperforms existing LoRA-based approaches, significantly reducing performance degradation on OOD data while improving in-distribution performance. Our work bridges the gap between parameter efficiency and distributional robustness, advancing the practical deployment of LLMs in real-world applications.

PDF Details

JBHI Journal 2025 Journal Article

MPSol: A Multimodal Prompt Learning Framework for Protein Solubility Prediction

Yuhang Zhang
Peilin Chen
Keyan Ding
Han Liu
Shiqi Wang
Qi Song

Protein solubility is a critical determinant of biologic candidates’ developability, stability, and therapeutic efficacy. However, accurate solubility prediction remains a central challenge in computational protein engineering due to the inherent complexity within protein sequences. In this work, we propose a multimodal prompt learning framework, called MPSol, for protein solubility prediction that integrates complementary representations derived from primary sequences, structural proxies, and textual descriptions generated by large language models (LLMs). MPSol is built upon a unified multimodal backbone with a dedicated cross-modal fusion module that captures fine-grained interactions across modalities. In addition, we design label-aware prompts that encode solubility-specific semantic cues associated with each class. These prompts provide semantic supervision, guiding the alignment of fused protein representations to promote semantic consistency. Extensive experiments demonstrate that MPSol achieves state-of-the-art performance, reaching an accuracy of 0. 815, AUC of 0. 867 and MCC of 0. 642 on the standard PDBSol test set, and generalizes well to the external out-of-distribution test dataset with an accuracy of 0. 632, AUC of 0. 653 and MCC of 0. 332. These results underscore the potential of prompt-driven multimodal learning for interpretable and effective protein property prediction.

Details DOI

ICLR Conference 2025 Conference Paper

SaMer: A Scenario-aware Multi-dimensional Evaluator for Large Language Models

Kehua Feng
Keyan Ding
Jing Yu
Yiwen Qu
Zhiwen Chen 0002
Chengfei Lv
Gang Yu
Qiang Zhang 0026

Evaluating the response quality of large language models (LLMs) for open-ended questions poses a significant challenge, especially given the subjectivity and multi-dimensionality of "quality" in natural language generation. Existing LLM evaluators often neglect that different scenarios require distinct evaluation criteria. In this work, we propose **SaMer**, a scenario-aware multi-dimensional evaluator designed to provide both overall and fine-grained assessments of LLM-generated responses. Unlike fixed-dimension evaluation approaches, SaMer adapts to different scenarios by automatically identifying and prioritizing relevant evaluation dimensions tailored to the given query. To achieve this, we construct a large-scale fine-grained preference dataset spanning multiple real-world scenarios, each with distinct evaluation dimensions. We then leverage a text embedding model combined with three specialized heads to predict the appropriate evaluation dimensions and corresponding scores, as well as the respective weights that contribute to the overall score. The resulting model offers fine-grained and interpretable evaluations and shows robust adaptability across diverse scenarios. Extensive experiments on eight single rating and pairwise comparison datasets demonstrate that SaMer outperforms existing baselines in a variety of evaluation tasks, showcasing its robustness, versatility, and generalizability.

Details

NeurIPS Conference 2024 Conference Paper

DePLM: Denoising Protein Language Models for Property Optimization

Zeyuan Wang
Keyan Ding
Ming Qin
Xiaotong Li
Xiang Zhuang
Yu Zhao
Jianhua Yao
Qiang Zhang

Protein optimization is a fundamental biological task aimed at enhancing theperformance of proteins by modifying their sequences. Computational methodsprimarily rely on evolutionary information (EI) encoded by protein languagemodels (PLMs) to predict fitness landscape for optimization. However, thesemethods suffer from a few limitations. (1) Evolutionary processes involve thesimultaneous consideration of multiple functional properties, often overshadowingthe specific property of interest. (2) Measurements of these properties tend to betailored to experimental conditions, leading to reduced generalizability of trainedmodels to novel proteins. To address these limitations, we introduce DenoisingProtein Language Models (DePLM), a novel approach that refines the evolutionaryinformation embodied in PLMs for improved protein optimization. Specifically, weconceptualize EI as comprising both property-relevant and irrelevant information, with the latter acting as “noise” for the optimization task at hand. Our approachinvolves denoising this EI in PLMs through a diffusion process conducted in therank space of property values, thereby enhancing model generalization and ensuringdataset-agnostic learning. Extensive experimental results have demonstrated thatDePLM not only surpasses the state-of-the-art in mutation effect prediction butalso exhibits strong generalization capabilities for novel proteins.

PDF Details DOI

ICML Conference 2024 Conference Paper

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Yuhao Wang
Qiang Zhang 0026
Ming Qin
Xiang Zhuang
Xiaotong Li
Zhichen Gong
Zeyuan Wang
Yu Zhao 0009

Directed evolution, a cornerstone of protein optimization, is to harness natural mutational processes to enhance protein functionality. Existing Machine Learning-assisted Directed Evolution (MLDE) methodologies typically rely on data-driven strategies and often overlook the profound domain knowledge in biochemical fields. In this paper, we introduce a novel Knowledge-aware Reinforced Language Model (KnowRLM) for MLDE. An Amino Acid Knowledge Graph (AAKG) is constructed to represent the intricate biochemical relationships among amino acids. We further propose a Protein Language Model (PLM)-based policy network that iteratively samples mutants through preferential random walks on the AAKG using a dynamic sliding window mechanism. The novel mutants are actively sampled to fine-tune a fitness predictor as the reward model, providing feedback to the knowledge-aware policy. Finally, we optimize the whole system in an active learning approach that mimics biological settings in practice. KnowRLM stands out for its ability to utilize contextual amino acid information from knowledge graphs, thus attaining advantages from both statistical patterns of protein sequences and biochemical properties of amino acids. Extensive experiments demonstrate the superior performance of KnowRLM in more efficiently identifying high-fitness mutants compared to existing methods.

Details

ECAI Conference 2023 Conference Paper

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Ming Qin
Keyan Ding
Bin Wu 0025
Zhenping Li
Haihong Yang
Zeyuan Wang
Hongbin Ye
Haoran Yu

Directed evolution is a widely-used strategy of protein engineering to improve protein function via mimicking natural mutation and selection. Machine learning-assisted directed evolution (MLDE) approaches aim to learn a fitness predictor, thereby efficiently searching for optimal mutants within the vast combinatorial mutation space. Since annotating mutants is both costly and labor-intensive, how to efficiently sample and utilize informative protein mutants to train the predictor is a critical problem in MLDE. Previous MLDE works just simply utilized pre-trained protein language models (PPLMs) for sampling without tailoring to the specific target protein of interest, which has not fully exploited the potential of PPLMs. In this work, we propose a novel method, the Actively-Finetuned Protein language model for Directed Evolution(AFP-DE), which leverages PPLMs to actively sample and fine-tune themselves, continuously improving the model’s sampling and overall performance through iterations, to achieve efficient directed protein evolution. Extensive experiments have shown the effectiveness of our method in generating optimal mutants with minimal annotation effort, outperforming previous works even with fewer annotated mutants, making it budget-friendly for biological experiments.

Details

IJCAI Conference 2023 Conference Paper

Graph Sampling-based Meta-Learning for Molecular Property Prediction

Xiang Zhuang
Qiang Zhang
Bin Wu
Keyan Ding
Yin Fang
Huajun Chen

Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5. 71%-6. 93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https: //github. com/HICAI-ZJU/GS-Meta.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Learning Invariant Molecular Representation in Latent Discrete Space

Xiang Zhuang
Qiang Zhang
Keyan Ding
Yatao Bian
Xiao Wang
Jingsong Lv
Hongyang Chen
Huajun Chen

Molecular representation learning lays the foundation for drug discovery. However, existing methods suffer from poor out-of-distribution (OOD) generalization, particularly when data for training and testing originate from different environments. To address this issue, we propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts. Specifically, we propose a strategy called ``first-encoding-then-separation'' to identify invariant molecule features in the latent space, which deviates from conventional practices. Prior to the separation step, we introduce a residual vector quantization module that mitigates the over-fitting to training data distributions while preserving the expressivity of encoders. Furthermore, we design a task-agnostic self-supervised learning objective to encourage precise invariance identification, which enables our method widely applicable to a variety of tasks, such as regression and multi-label classification. Extensive experiments on 18 real-world molecular datasets demonstrate that our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts. Our code is available at https: //github. com/HICAI-ZJU/iMoLD.

PDF Details