Arrow Research search

Author name cluster

Xu Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

AAAI Conference 2026 Conference Paper

FaNe: Towards Fine-Grained Cross-Modal Contrast with False-Negative Reduction and Text-Conditioned Sparse Attention

  • Peng Zhang
  • Zhihui Lai
  • Wenting Chen
  • Xu Wu
  • Heng Kong

Medical vision-language pre-training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image-report data. However, existing methods are limited by False Negatives (FaNe) induced by semantically similar texts and insufficient fine-grained cross-modal alignment. To address these limitations, we propose FaNe, a semantic-enhanced VLP framework. To mitigate false negatives, we introduce a semantic-aware positive pair mining strategy based on text-text similarity with adaptive normalization. Furthermore, we design a text-conditioned sparse attention pooling module to enable fine-grained image-text alignment through localized visual representations guided by textual cues. To strengthen intra-modal discrimination, we develop a hard-negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state-of-the-art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

NeurIPS Conference 2025 Conference Paper

Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization

  • Yanhao Jia
  • Ji Xie
  • S Jivaganesh
  • Li Hao
  • Xu Wu
  • Mengmi Zhang

Imagine hearing a dog bark and instinctively turning toward the sound—only to find a parked car, while a silent dog sits nearby. Such moments of sensory conflict challenge perception, yet humans flexibly resolve these discrepancies, prioritizing auditory cues over misleading visuals to accurately localize sounds. Despite the rapid advancement of multimodal AI models that integrate vision and sound, little is known about how these systems handle cross-modal conflicts or whether they favor one modality over another. Here, we systematically and quantitatively examine modality bias and conflict resolution in AI models for Sound Source Localization (SSL). We evaluate a wide range of state-of-the-art multimodal models and compare them against human performance in psychophysics experiments spanning six audiovisual conditions, including congruent, conflicting, and absent visual and audio cues. Our results reveal that humans consistently outperform AI in SSL and exhibit greater robustness to conflicting or absent visual information by effectively prioritizing auditory signals. In contrast, AI shows a pronounced bias toward vision, often failing to suppress irrelevant or conflicting visual input, leading to chance-level performance. To bridge this gap, we present EchoPin, a neuroscience-inspired multimodal model for SSL that emulates human auditory perception. The model is trained on our carefully curated AudioCOCO dataset, in which stereo audio signals are first rendered using a physics-based 3D simulator, then filtered with Head-Related Transfer Functions (HRTFs) to capture pinnae, head, and torso effects, and finally transformed into cochleagram representations that mimic cochlear processing. To eliminate existing biases in standard benchmark datasets, we carefully controlled the vocal object sizes, semantics, and spatial locations in the corresponding images of AudioCOCO. EchoPin outperforms existing models trained on standard audio-visual datasets. Remarkably, consistent with neuroscience findings, it exhibits a human-like localization bias, favoring horizontal (left–right) precision over vertical (up–down) precision. This asymmetry likely arises from HRTF-shaped and cochlear-modulated stereo audio and the lateral placement of human ears, highlighting how sensory input quality and physical structure jointly shape precision of multimodal representations. All code, data, and models are available \href{https: //github. com/CuriseJia/SSHS}{here}.

JBHI Journal 2024 Journal Article

Spatio-Temporal Classification of Lung Ventilation Patterns Using 3D EIT Images: A General Approach for Individualized Lung Function Evaluation

  • Shuzhe Chen
  • Li Li
  • Zhichao Lin
  • Ke Zhang
  • Ying Gong
  • Lu Wang
  • Xu Wu
  • Maokun Li

The Pulmonary Function Test (PFT) is a widely utilized and rigorous classification test for evaluating lung function, serving as a comprehensive diagnostic tool for lung conditions. Meanwhile, Electrical Impedance Tomography (EIT) is a rapidly advancing clinical technique that visualizes conductivity distribution induced by ventilation. EIT provides additional spatial and temporal information on lung ventilation beyond traditional PFT. However, relying solely on conventional isolated interpretations of PFT results and EIT images overlooks the continuous dynamic aspects of lung ventilation. This study aims to classify lung ventilation patterns by extracting spatial and temporal features from the 3D EIT image series. The study uses a Variational Autoencoder (VAE) with a MultiRes block to compress the spatial distribution in a 3D image into a one-dimensional vector. These vectors are then stacked to create a feature map for the exhibition of temporal features. A simple convolutional neural network is used for classification. Data from 137 subjects were utilized for the training phase. Initially, the model underwent validation through a leave-one-out cross-validation process. During this validation, the model achieved an accuracy and sensitivity of 0. 96 and 1. 00, respectively, with an f1-score of 0. 98 when identifying the normal subjects. To assess pipeline reliability and feasibility, we tested it on 9 newly recruited subjects, with accurate ventilation mode predictions for 8 out of 9. In addition, we included 2D EIT results for comparison and conducted ablation experiments to validate the effectiveness of the VAE. The study demonstrates the potential of using image series for lung ventilation mode classification, providing a feasible method for patient prescreening and presenting an alternative form of PFT.