MoLoRA: Boosting LLM-based End-to-end Speech Translation with Mixture of Low-rank Experts

Hao Zhang; Yaqi Chen; Nianwen Si; XuKui Yang; Wenlin Zhang; Dan Qu

doi:10.1609/aaai.v40i41.40769

Back to AAAI

AAAI 2026

MoLoRA: Boosting LLM-based End-to-end Speech Translation with Mixture of Low-rank Experts

Conference Paper AAAI Technical Track on Natural Language Processing VI Artificial Intelligence

PDF Details DOI

Abstract

Recently, End-to-End Speech Translation (E2E-ST) methods leveraging large language models (LLMs) have demonstrated strong generalization capabilities and excellent scalability by integrating pre-trained speech encoders with LLMs, where Low-Rank Adaptation (LoRA) is commonly used for parameter-efficient fine-tuning to reduce training costs. However, LoRA's low-rank assumption often fails in multilingual tasks, as the inherent complexity of cross-lingual semantic relationships and syntactic variations exceeds the representational capacity of low-rank matrices. This leads to parameter conflicts across languages, resulting in suboptimal performance. To address this issue, we propose Mixture of Low-Rank Adaptations (MoLoRA), which integrates the Mixture of Experts (MoE) mechanism with LoRA. MoLoRA effectively enhances the model's expressive capacity while maintaining parameter efficiency during training. Specifically, we treat multiple LoRA modules as low-rank experts and introduce a routing mechanism to dynamically activate language-specific experts. Additionally, shared experts are incorporated and consistently activated to model cross-lingual general knowledge. Furthermore, to enhance the robustness and accuracy of speech representations, we propose a Multi-Granularity Representation Fusion module (MGRF). This module mitigates local distortions in frame-level speech representations caused by noise by fusing frame-level and sentence-level features, thereby providing the LLM with more accurate high-level semantic information. We conduct multilingual experiments on the MuST-C and CoVoST-2 datasets. Our method achieves an average BLEU score of 32.2 across eight language pairs on the MuST-C dataset and an average of 36.3 across three language pairs on the CoVoST-2 dataset, establishing a new state-of-the-art (SOTA) performance.

MoLoRA: Boosting LLM-based End-to-end Speech Translation with Mixture of Low-rank Experts

Abstract

Authors

Keywords

Context