EAAI Journal 2026 Journal Article
Deep multi-modal fusion transformer for emotion recognition
- Qian Zhang
- Yifan Liu
- Biaokai Zhu
- Xun Han
- Ruilin Zhang
- Jun Xiao
- Zhe Wang
Multi-modal emotion recognition methods usually integrate peripheral and physiological information to extract complementary features. However, current multi-modal methods face shortcomings in spatio-temporal dependency modeling and feature complementary feature extraction across modalities, leading to limitations in accuracy and robustness for emotion recognition. To address these issues, this paper proposes a multi-modal emotion recognition network based on cross-modal transformer fusion to enhance the collaborative processing ability of electroencephalogram (EEG) and facial expression data. Specifically, we construct a fine and coarse combined transformer encoder for multi-level spatiotemporal feature extraction of EEG signals, and introduce a multi-modal cross-attention fusion transformer to achieve deep fusion of EEG, facial expression features and joint features, capturing dynamic relationships between and within modalities. Experimental results in two public datasets show that this method outperforms existing methods in accuracy and robustness, achieving a deep fusion of emotional dynamic and complex characteristics.