CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Han Li; Jingwei Sun; Junqing Lin; Guangzhong Sun

doi:10.1609/aaai.v40i27.39454

Back to AAAI

AAAI 2026

CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Conference Paper AAAI Technical Track on Machine Learning IV Artificial Intelligence

PDF Details DOI

Abstract

Mixture of Experts (MoE) models have emerged as a promising approach to scale language models efficiently by activating only a subset of parameters for each input. However, deploying these models under GPU memory constraints remains challenging, as existing offloading strategies incur significant overhead from CPU-GPU data transfers. While prior work has explored prefetching techniques to mitigate this bottleneck, these methods require costly fallback mechanisms when predictions fail. Since expert transfers cannot be canceled once initiated, the correct experts need to be loaded on demand sequentially, introducing additional latency. To address this, we present CommitMoE, a novel approach featuring a Commit Router that makes execution decisions based on expert predictions without fallback mechanisms. Our key insight reveals that router certainty strongly correlates with prediction accuracy, while in low-certainty scenarios, the model output demonstrates inherent robustness to expert selection. Leveraging this insight to design a systems-level solution, CommitMoE achieves 1.3× to 9.4× faster inference across different environments and datasets compared to state-of-the-art offloading frameworks while maintaining model quality.

CommitMoE: Efficient Fallback-Free MoE Inference with Offloading Under GPU Memory Constraints

Abstract

Authors

Keywords

Context