Arrow Research search

Author name cluster

Xiaojie Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

NeurIPS Conference 2025 Conference Paper

Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization

  • Zixuan Huang
  • Yikun Ban
  • Lean Fu
  • Xiaojie Li
  • Zhongxiang Dai
  • Jianxin Li
  • Deqing Wang

Direct Preference Optimization (DPO) has emerged as an effective approach for aligning large language models (LLMs) with human preferences. However, its performance is highly dependent on the quality of the underlying human preference data. To address this bottleneck, prior work has explored various data selection strategies, but these methods often overlook the impact of the evolving states of the language model during the optimization process. In this paper, we introduce a novel problem: Sample Scheduling for DPO, which aims to dynamically and adaptively schedule training samples based on the model's evolving batch-wise states throughout preference optimization. To solve this problem, we propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch based on the LLM's learning feedback to maximize the potential generalization performance. Notably, without modifying the core DPO algorithm, simply integrating SamS significantly improves performance across tasks, with minimal additional computational overhead. This work points to a promising new direction for improving LLM alignment through batch-wise sample selection, with potential generalization to RLHF and broader supervised learning paradigms.

NeurIPS Conference 2024 Conference Paper

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning

  • Yibo Yang
  • Xiaojie Li
  • Zhongzhu Zhou
  • Shuaiwen L. Song
  • Jianlong Wu
  • Liqiang Nie
  • Bernard Ghanem

Current parameter-efficient fine-tuning (PEFT) methods build adapters widely agnostic of the context of downstream task to learn, or the context of important knowledge to maintain. As a result, there is often a performance gap compared to full-parameter fine-tuning, and meanwhile the fine-tuned model suffers from catastrophic forgetting of the pre-trained world knowledge. In this paper, we propose **CorDA**, a Context-oriented Decomposition Adaptation method that builds learnable **task-aware adapters** from weight decomposition oriented by the context of downstream task or the world knowledge to maintain. Concretely, we collect a few data samples, and perform singular value decomposition for each linear layer of a pre-trained LLM multiplied by the covariance matrix of the input activation using these samples. The inverse of the covariance matrix is multiplied with the decomposed components to reconstruct the original weights. By doing so, the context of the representative samples is captured through deciding the factorizing orientation. Our method enables two options, the **knowledge-preserved adaptation** and the **instruction-previewed adaptation**. For the former, we use question-answering samples to obtain the covariance matrices, and use the decomposed components with the smallest $r$ singular values to initialize a learnable adapter, with the others frozen such that the world knowledge is better preserved. For the latter, we use the instruction data from the fine-tuning task, such as math or coding, to orientate the decomposition and train the largest $r$ components that most correspond to the task to learn. We conduct extensive experiments on Math, Code, and Instruction Following tasks. Our knowledge-preserved adaptation not only achieves better performance than LoRA on fine-tuning tasks, but also mitigates the forgetting of world knowledge. Our instruction-previewed adaptation is able to further enhance the fine-tuning performance to be comparable with full fine-tuning, surpassing the state-of-the-art PEFT methods such as LoRA, DoRA, and PiSSA.

ICML Conference 2024 Conference Paper

Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

  • Yibo Yang
  • Xiaojie Li
  • Motasem Alfarra
  • Hasan Abed Al Kader Hammoud
  • Adel Bibi
  • Philip H. S. Torr
  • Bernard Ghanem

Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w. r. t. its input is not reconciled with the local gradient in the previous module w. r. t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvements compared to previous methods. Particularly, our method for CNN and Transformer architectures on ImageNet is able to attain a competitive performance with global BP, saving more than 40% memory consumption.

NeurIPS Conference 2020 Conference Paper

Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

  • Shangchen Du
  • Shan You
  • Xiaojie Li
  • Jianlong Wu
  • Fei Wang
  • Chen Qian
  • Changshui Zhang

Distilling knowledge from an ensemble of teacher models is expected to have a more promising performance than that from a single one. Current methods mainly adopt a vanilla average rule, i. e. , to simply take the average of all teacher losses for training the student network. However, this approach treats teachers equally and ignores the diversity among them. When conflicts or competitions exist among teachers, which is common, the inner compromise might hurt the distillation performance. In this paper, we examine the diversity of teacher models in the gradient space and regard the ensemble knowledge distillation as a multi-objective optimization problem so that we can determine a better optimization direction for the training of student network. Besides, we also introduce a tolerance parameter to accommodate disagreement among teachers. In this way, our method can be seen as a dynamic weighting method for each teacher in the ensemble. Extensive experiments validate the effectiveness of our method for both logits-based and feature-based cases.

IJCAI Conference 2013 Conference Paper

Manifold Alignment Based on Sparse Local Structures of More Corresponding Pairs

  • Xiaojie Li
  • Jian Cheng Lv
  • Zhang Yi

Manifold alignment is to extract the shared latent semantic structure from multiple manifolds. The joint adjacency matrix plays a key role in manifold alignment. To construct the matrix, it is crucial to get more corresponding pairs. This paper proposes an approach to obtain more and reliable corresponding pairs in terms of local structure correspondence. The sparse reconstruction weight matrix of each manifold is established to preserve the local geometry of the original data set. The sparse correspondence matrices are constructed using the sparse local structures of corresponding pairs across manifolds. Further more, a new energy function for manifold alignment is proposed to simultaneously match the corresponding instances and preserve the local geometry of each manifold. The shared low dimensional embedding, which provides better descriptions for the intrinsic geometry and relations between different manifolds, can be obtained by solving the optimization problem with closed-form solution. Experiments demonstrate the effectiveness of the proposed algorithm.