JBHI Journal 2026 Journal Article
A 1D Snoring Waveform and 2D Composite Acoustic Feature Graph-Based Multi-Modal Fusion Network for Obstructive Sites Recognition
- Xia Hu
- Rui Fang
- Huiping Luo
- Jingchun Luo
- Chen Chen
- Wei Chen
As a critical factor in diagnostic work-up and treatment decision-making process of sleep-related breathing disorders, accurate localization of obstructive sites in the upper airway is in dire need. Snoring, as a dynamic acoustic signal, carries informative information relating to the sites and degree of obstruction in the upper airway, offering a non-invasive, cost-effective solution for obstructive sites recognition. However, most of existing snoring-based methods for recognizing obstructive sites only involve limited information (either mainly concentrated on traditional acoustic characteristics or spectrogram features), which may omit dynamic pathological information. Moreover, existing methods proceed from either a one-dimensional (1D) signal or two-dimensional (2D) image perspective, where complementary information from the other modality may be overlooked. In this paper, a multi-modal framework, which combines 1D snoring waveform and 2D Composite Acoustic Feature Graph (CAF-Graph), is proposed. 1D snoring waveform perceives fine time structure and local patterns, aiming at learning high-level discriminative representations by neural networks. 2D CAF-Graph is dedicated to emphasizing dynamic spatio-temporal and physiological-acoustic characteristic of snoring, which concatenates acoustic features related to Prosodic, Formant, Spectral, and Cepstral characteristics. Further, a multi-modal fusion network (BMFNet) effectively integrates independent and interactive information between single-modal features, which offers a more comprehensive perspective. The recognition task was formulated as a three-class classification problem, including upper (snoring caused by upper-level obstruction), lower (snoring caused by lower-level obstruction), and silence (obstruction without snoring). The proposed method was validated on a clinical dataset collected in the ENT institute and Department of Otorhinolaryngology, Eye & ENT Hospital, Fudan University, where reached 81. 2% Accuracy, 86. 8% Weighted Average Precision, 81. 2% Weighted Average Recall, and 82. 3% Weighted Average F1-Score. Results exhibit the effectiveness of multi-modal feature representations for snoring, providing a novel insight for obstructive sites recognition tasks.