AAAI Conference 2026 Conference Paper
Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models
- Bo Wang
- Junzhuo Li
- Hong Chen
- Yuanlin Chu
- Yuxuan Fan
- Xuming Hu
Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training—and how this process differs from dense architectures—remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M (~ 5.0T tokens) and 600K (~ 2.5T tokens) training steps, respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within 50% for the dense model, showing that sparsity fosters distributed—rather than brittle—knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.