Arrow Research search

Author name cluster

Ke Chen 0005

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
1 author row

Possible papers

2

ICML Conference 2025 Conference Paper

FloE: On-the-Fly MoE Inference on Memory-constrained GPU

  • Yuxin Zhou
  • Zheng Li 0006
  • Jun Zhang 0069
  • Jue Wang 0019
  • Yiping Wang 0003
  • Zhongle Xie
  • Ke Chen 0005
  • Lidan Shou

With the widespread adoption of Mixture-of-Experts (MoE) models, there is a growing demand for efficient inference on memory-constrained devices. While offloading expert parameters to CPU memory and loading activated experts on demand has emerged as a potential solution, the large size of activated experts overburdens the limited PCIe bandwidth, hindering the effectiveness in latency-sensitive scenarios. To mitigate this, we propose FloE, an on-the-fly MoE inference system on memory-constrained GPUs. FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts. It employs various compression techniques on the expert’s internal parameter matrices to reduce the data movement load, combined with low-cost sparse prediction, achieving perceptible inference acceleration in wall-clock time on resource-constrained devices. Empirically, FloE achieves a 9. 3$\times$ compression of parameters per expert in Mixtral-8$\times$7B; enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8. 5$\times$; and delivers a 48. 7$\times$ inference speedup compared to DeepSpeed-MII on a single GeForce RTX 3090—all with only a 4. 4% $\sim$ 7. 6% average performance degradation.

ICLR Conference 2025 Conference Paper

Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models

  • Jun Zhang 0069
  • Jue Wang 0019
  • Huan Li 0003
  • Lidan Shou
  • Ke Chen 0005
  • Yang You
  • Guiming Xie
  • Xuejian Gong

Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81× (16.95×), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B). Code is available at https://github.com/junzhang-zj/LoRAM.