Author name cluster

Changhai Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers

2 author rows

ICML Conference 2025 Conference Paper

BSLoRA: Enhancing the Parameter Efficiency of LoRA with Intra-Layer and Inter-Layer Sharing

Yuhua Zhou
Ruifeng Li
Changhai Zhou
Fei Yang
Aimin Pan

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models (LLMs) to adapt to downstream tasks. However, in scenarios where multiple LoRA models are deployed simultaneously, standard LoRA introduces substantial trainable parameters, resulting in significant memory overhead and inference latency, particularly when supporting thousands of downstream tasks on a single server. While existing methods reduce stored parameters via parameter sharing, they fail to capture both local and global information simultaneously. To address this issue, we propose the Bi-Share LoRA (BSLoRA), which extends local LoRA with intra-LoRA and inter-LoRA parameter sharing to better capture local and global information. This approach reduces trainable parameters while maintaining or even enhancing model performance. Additionally, we design three transformation methods to improve the compatibility and collaborative efficiency of shared parameters with varying shapes, enhancing overall adaptability. Experiments on the 7B, 8B, and 13B versions of Llama show that BSLoRA, with only 44. 59% of the parameters of standard LoRA, outperforms LoRA by approximately 0. 33% on commonsense reasoning and 2. 08% on MMLU benchmarks. Code is available at https: //github. com/yuhua-zhou/BSLoRA. git.

Details

AAAI Conference 2025 Conference Paper

Dynamic Operator Optimization for Efficient Multi-Tenant LoRA Model Serving

Changhai Zhou
Yuhua Zhou
Shiyang Zhang
Yibin Wang
Zekai Liu

Low-Rank Adaptation (LoRA) has become increasingly popular for efficiently fine-tuning large language models (LLMs) with minimal resources. However, traditional methods that serve multiple LoRA models independently result in redundant computation and low GPU utilization. This paper addresses these inefficiencies by introducing Dynamic Operator Optimization (Dop), an advanced automated optimization technique designed to dynamically optimize the Segmented Gather Matrix-Vector Multiplication (SGMV) operator based on specific scenarios. SGMV's unique design enables batching GPU operations for different LoRA models, significantly improving computational efficiency. The Dop approach leverages a Search Space Constructor to create a hierarchical search space, dividing the program space into high-level structural sketches and low-level implementation details, ensuring diversity and flexibility in operator implementation. Furthermore, an Optimization Engine refines these implementations using evolutionary search, guided by a cost model that estimates program performance. This iterative optimization process ensures that SGMV implementations can dynamically adapt to different scenarios to maintain high performance. We demonstrate that Dop can improve throughput by 1.30-1.46 times in a SOTA multi-tenant LoRA serving.

PDF Details DOI