Arrow Research search

Author name cluster

Xichen Ye

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

AAAI Conference 2026 Conference Paper

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

  • Yifan Wu
  • Jiyue Jiang
  • Xichen Ye
  • Yiqi Wang
  • Chang Zhou
  • Yitao Xu
  • Jiayang Chen
  • He Hu

Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility—particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach first introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost. Built upon this, we propose two simple yet effective selection strategies: Top-k Influence (Top I) and Coverage-Centric Influence (CCI). Then, we empirically validate our method on two representative BioFMs: RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99%, which displays our framework's effectiveness. Furthermore, we demonstrate the generalizability of our framework on protein-related tasks using ESM-C. Specifically, our coreset even outperforms random 10x subsets in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

AAAI Conference 2026 Conference Paper

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

  • Yifan Xu
  • Xichen Ye
  • Yifan Chen
  • Qiaosheng Zhang

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and Direct Preference Optimization (DPO) algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

AAAI Conference 2025 Conference Paper

Optimized Gradient Clipping for Noisy Label Learning

  • Xichen Ye
  • Yifan Wu
  • Weizhong Zhang
  • Xiaoqiang Li
  • Yifan Chen
  • Cheng Jin

Previous research has shown that constraining the gradient of loss function w.r.t. model-predicted probabilities can enhance the model robustness against noisy labels. These methods typically specify a fixed optimal threshold for gradient clipping through validation data to obtain the desired robustness against noise. However, this common practice overlooks the dynamic distribution of gradients from both clean and noisy-labeled samples at different stages of training, significantly limiting the model capability to adapt to the variable nature of gradients throughout the training process. To address this issue, we propose a simple yet effective approach called Optimized Gradient Clipping (OGC), which dynamically adjusts the clipping threshold based on the ratio of noise gradients to clean gradients after clipping, estimated by modeling the distributions of clean and noisy samples. This approach allows us to modify the clipping threshold at each training step, effectively controlling the influence of noise gradients. Additionally, we provide statistical analysis to certify the noise-tolerance ability of OGC. Our extensive experiments across various types of label noise, including symmetric, asymmetric, instance-dependent, and real-world noise, demonstrate the effectiveness of our approach.

ICML Conference 2025 Conference Paper

Towards Robust Influence Functions with Flat Validation Minima

  • Xichen Ye
  • Yifan Wu 0011
  • Weizhong Zhang
  • Cheng Jin 0001
  • Yifan Chen 0004

The Influence Function (IF) is a widely used technique for assessing the impact of individual training samples on model predictions. However, existing IF methods often fail to provide reliable influence estimates in deep neural networks, particularly when applied to noisy training data. This issue does not stem from inaccuracies in parameter change estimation, which has been the primary focus of prior research, but rather from deficiencies in loss change estimation, specifically due to the sharpness of validation risk. In this work, we establish a theoretical connection between influence estimation error, validation set risk, and its sharpness, underscoring the importance of flat validation minima for accurate influence estimation. Furthermore, we introduce a novel estimation form of Influence Function specifically designed for flat validation minima. Experimental results across various tasks validate the superiority of our approach.

NeurIPS Conference 2023 Conference Paper

Active Negative Loss Functions for Learning with Noisy Labels

  • Xichen Ye
  • Xiaoqiang Li
  • Songmin Dai
  • Tong Liu
  • Yan Sun
  • Weiqin Tong

Robust loss functions are essential for training deep neural networks in the presence of noisy labels. Some robust loss functions use Mean Absolute Error (MAE) as its necessary component. For example, the recently proposed Active Passive Loss (APL) uses MAE as its passive loss function. However, MAE treats every sample equally, slows down the convergence and can make training difficult. In this work, we propose a new class of theoretically robust passive loss functions different from MAE, namely Normalized Negative Loss Functions (NNLFs), which focus more on memorized clean samples. By replacing the MAE in APL with our proposed NNLFs, we improve APL and propose a new framework called Active Negative Loss (ANL). Experimental results on benchmark and real-world datasets demonstrate that the new set of loss functions created by our ANL framework can outperform state-of-the-art methods. The code is available athttps: //github. com/Virusdoll/Active-Negative-Loss.

AAAI Conference 2023 Conference Paper

GradPU: Positive-Unlabeled Learning via Gradient Penalty and Positive Upweighting

  • Songmin Dai
  • Xiaoqiang Li
  • Yue Zhou
  • Xichen Ye
  • Tong Liu

Positive-unlabeled learning is an essential problem in many real-world applications with only labeled positive and unlabeled data, especially when the negative samples are difficult to identify. Most existing positive-unlabeled learning methods will inevitably overfit the positive class to some extent due to the existence of unidentified positive samples. This paper first analyzes the overfitting problem and proposes to bound the generalization errors via Wasserstein distances. Based on that, we develop a simple yet effective positive-unlabeled learning method, GradPU, which consists of two key ingredients: A gradient-based regularizer that penalizes the gradient norms in the interpolated data region, which improves the generalization of positive class; An unnormalized upweighting mechanism that assigns larger weights to those positive samples that are hard, not-well-fitted and less frequently labeled. It enforces the training error of each positive sample to be small and increases the robustness to the labeling bias. We evaluate our proposed GradPU on three datasets: MNIST, FashionMNIST, and CIFAR10. The results demonstrate that GradPU achieves state-of-the-art performance on both unbiased and biased positive labeling scenarios.