EAAI Journal 2026 Journal Article
A trinity-branch parallel fusion and supervised enhancement network: A multimodal celiac disease diagnosis network based on transformer and dual-tower supervision
- Jiahe Li
- Tian Shi
- Chen Chen
- Xuguang Zhou
- Wei Liu
- Xiaoyi Lv
- Feng Gao
- Cheng Chen
Celiac disease (CD) is a complex autoimmune disorder where accurate diagnosis is crucial for improving patients' quality of life. Spectroscopic analysis, with its high sensitivity and non-invasive nature, can reveal subtle molecular-level changes in samples, providing an objective and reliable basis for diagnosing complex diseases like CD. Integrating multi-source omics data from techniques such as Raman spectroscopy, infrared spectroscopy, and metabolomics promises a comprehensive view for disease diagnosis. The key challenge, however, lies in how to effectively fuse this multi-modal data to fully leverage their complementary information. To address this challenge, we introduce a novel multi-modal deep fusion framework called Trinity Branch Parallel Fusion and Supervised Enhancement Net (TFS-Net). This framework employs an efficient multi-stage architecture to systematically process and fuse three types of omics data. First, dedicated modality-specific feature encoders, such as multi-scale dynamic convolutions for spectroscopic data and a self-attention-enhanced MLP for metabolomics data, are used for efficient intra-modal feature encoding. Next, a cross-modal attention mechanism deeply explores the pairwise interaction relationships between modalities. Building on this, our study innovatively introduces a dual-tower similarity supervision auxiliary task to enhance the consistency of feature representations across different modalities. Finally, a Transformer encoder performs global contextual modeling on all features to output the final diagnostic prediction. On a celiac disease dataset, the TFS-Net model demonstrates superior diagnostic performance. It achieves a 95. 82% accuracy through five-fold cross-validation, significantly outperforming existing single-modal baseline models and state-of-the-art multi-modal fusion methods. Furthermore, systematic ablation studies validate the necessity and effectiveness of our proposed multi-modal strategy and key model components, including dynamic convolution, cross-modal attention, and dual-tower supervision.