Author name cluster

Wenjun Ke

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers

1 author row

AAAI Conference 2026 Conference Paper

Balanced Knowledge Distillation for Large Language Models with Mix-of-Experts

Jiajun Liu
Yao He
Wenjun Ke
Peng Wang
Ziyu Shang
Guozheng Li
Zijie Xu

Mixture-of-Experts (MoE) architectures have recently become a more prevalent choice for large language models (LLMs) than dense architectures due to their superior performance. However, billions of parameters bring MoE LLMs a huge cost for deployment and inference. To address these issues, knowledge distillation (KD) has become a widely adopted technique to compress LLMs. Existing KD methods for LLMs can be divided into dense-to-dense and moe-to-dense distillation. Dense-to-dense distillation transfers knowledge between single dense LLMs, while moe-to-dense distillation attempts to transfer knowledge between the MoE LLMs and the dense LLMs. However, the architectural mismatch prevents the student from fully absorbing knowledge when distilling MoE LLMs. To address this limitation, we investigate a new distillation setting, moe-to-moe, which aims to fully leverage expert knowledge of teachers and enable the student to absorb it more effectively. Compared to dense-to-dense and moe-to-dense, moe-to-moe suffers from two imbalance issues. First, expert-coverage deficiency reflects an imbalanced knowledge transfer of teacher experts: traditional distillation utilizes only the few experts activated by the teacher router. Second, routing imbalance appears when the student routing distribution drifts from the teacher, which makes it difficult for students to learn how to distribute different experts. To overcome these issues, we propose a novel distillation framework for moe-to-moe, Balanced Distillation (B-Distill), which equally spreads teacher expertise across student experts while regularizing the student router toward teacher-consistent balance. First, to mitigate expert-coverage deficiency, we introduce Monte Carlo exploration, which stochastically perturbs router probabilities so every teacher and student expert is sampled without enlarging the search space. Second, to correct routing imbalance and avert load collapse, we propose an entropy-aware router distillation mechanism that aligns the student router with the teacher while curbing over-concentration. Experiments show that B-Distill outperforms baselines by up to 6.6% in Rouge-L.