Author name cluster

Cong Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

1 author row

AAAI Conference 2026 Conference Paper

Conditional Information Bottleneck for Multimodal Fusion: Overcoming Shortcut Learning in Sarcasm Detection

Yihua Wang
Qi Jia
Cong Xu
Feiyu Chen
Yuhan Liu
Haotian Zhang
Liang Jin
Lu Liu

Multimodal sarcasm detection is a complex task that requires distinguishing subtle complementary signals across modalities while filtering out irrelevant information. Many advanced methods rely on learning shortcuts from datasets rather than extracting intended sarcasm-related features. However, our experiments show that shortcut learning impairs the model's generalization in real-world scenarios. Furthermore, we reveal the weaknesses of current modality fusion strategies for multimodal sarcasm detection through systematic experiments, highlighting the necessity of focusing on effective modality fusion for complex emotion recognition. To address these challenges, we construct MUStARD++R by removing shortcut signals from MUStARD++. Then, a Multimodal Conditional Information Bottleneck (MCIB) model is introduced to enable efficient multimodal fusion for sarcasm detection. Experimental results show that the MCIB achieves the best performance without relying on shortcut learning.

PDF Details DOI

AAAI Conference 2025 Conference Paper

STAIR: Manipulating Collaborative and Multimodal Information for E-Commerce Recommendation

Cong Xu
Yunhang He
Jun Wang
Wei Zhang

While the mining of modalities is the focus of most multimodal recommendation methods, we believe that how to fully utilize both collaborative and multimodal information is pivotal in e-commerce scenarios where, as clarified in this work, the user behaviors are rarely determined entirely by multimodal features. In order to combine the two distinct types of information, some additional challenges are encountered: 1) Modality erasure: Vanilla graph convolution, which proves rather useful in collaborative filtering, however erases multimodal information; 2) Modality forgetting: Multimodal information tends to be gradually forgotten as the recommendation loss essentially facilitates the learning of collaborative information. To this end, we propose a novel approach named STAIR, which employs a novel stepwise graph convolution to enable a co-existence of collaborative and multimodal information in e-commerce recommendation. Besides, it starts with the raw multimodal features as an initialization, and the forgetting problem can be significantly alleviated through constrained embedding updates. As a result, STAIR achieves state-of-the-art recommendation performance on three public e-commerce datasets with minimal computational and memory costs.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

EEGPT: Pretrained Transformer for Universal and Reliable Representation of EEG Signals

Guagnyu Wang
Wenchao Liu
Yuhong He
Cong Xu
Lin Ma
Haifeng Li

Electroencephalography (EEG) is crucial for recording brain activity, with applications in medicine, neuroscience, and brain-computer interfaces (BCI). However, challenges such as low signal-to-noise ratio (SNR), high inter-subject variability, and channel mismatch complicate the extraction of robust, universal EEG representations. We propose EEGPT, a novel 10-million-parameter pretrained transformer model designed for universal EEG feature extraction. In EEGPT, a mask-based dual self-supervised learning method for efficient feature extraction is designed. Compared to other mask-based self-supervised learning methods, EEGPT introduces spatio-temporal representation alignment. This involves constructing a self-supervised task based on EEG representations that possess high SNR and rich semantic information, rather than on raw signals. Consequently, this approach mitigates the issue of poor feature quality typically extracted from low SNR signals. Additionally, EEGPT's hierarchical structure processes spatial and temporal information separately, reducing computational complexity while increasing flexibility and adaptability for BCI applications. By training on a large mixed multi-task EEG dataset, we fully exploit EEGPT's capabilities. The experiment validates the efficacy and scalability of EEGPT, achieving state-of-the-art performance on a range of downstream tasks with linear-probing. Our research advances EEG representation learning, offering innovative solutions for bio-signal processing and AI applications. The code for this paper is available at: https: //github. com/BINE022/EEGPT

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Graph-enhanced Optimizers for Structure-aware Recommendation Embedding Evolution

Cong Xu
Jun Wang
Jianyong Wang
Wei Zhang

Embedding plays a key role in modern recommender systems because they are virtual representations of real-world entities and the foundation for subsequent decision-making models. In this paper, we propose a novel embedding update mechanism, Structure-aware Embedding Evolution (SEvo for short), to encourage related nodes to evolve similarly at each step. Unlike GNN (Graph Neural Network) that typically serves as an intermediate module, SEvo is able to directly inject graph structural information into embedding with minimal computational overhead during training. The convergence properties of SEvo along with its potential variants are theoretically analyzed to justify the validity of the designs. Moreover, SEvo can be seamlessly integrated into existing optimizers for state-of-the-art performance. Particularly SEvo-enhanced AdamW with moment estimate correction demonstrates consistent improvements across a spectrum of models and datasets, suggesting a novel technical route to effectively utilize graph structural information beyond explicit GNN modules.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline

Qi Jia
Baoyu Fan
Cong Xu
Lu Liu
Liang Jin
Guoguang Du
Zhenhua Guo
Yaqian Zhao

Existing video multi-modal sentiment analysis mainly focuses on the sentiment expression of people within the video, yet often neglects the induced sentiment of viewers while watching the videos. Induced sentiment of viewers is essential for inferring the public response to videos and has broad application in analyzing public societal sentiment, effectiveness of advertising and other areas. The micro videos and the related comments provide a rich application scenario for viewers’ induced sentiment analysis. In light of this, we introduces a novel research task, Multimodal Sentiment Analysis for Comment Response of Video Induced(MSA-CRVI), aims to infer opinions and emotions according to comments response to micro video. Meanwhile, we manually annotate a dataset named Comment Sentiment toward to Micro Video (CSMV) to support this research. It is the largest video multi-modal sentiment dataset in terms of scale and video duration to our knowledge, containing 107, 267 comments and 8, 210 micro videos with a video duration of 68. 83 hours. To infer the induced sentiment of comment should leverage the video content, we propose the Video Content-aware Comment Sentiment Analysis (VC-CSA) method as a baseline to address the challenges inherent in this new task. Extensive experiments demonstrate that our method is showing significant improvements over other established baselines. We make the dataset and source code publicly available at https: //github. com/IEIT-AGI/MSA-CRVI.

PDF Details DOI

JBHI Journal 2022 Journal Article

Sparse-Based Domain Adaptation Network for OCTA Image Super-Resolution Reconstruction

Huaying Hao
Cong Xu
Dan Zhang
Qifeng Yan
Jiong Zhang
Yue Liu
Yitian Zhao

Retinal Optical Coherence Tomography Angiography (OCTA) with high-resolution is important for the quantification and analysis of retinal vasculature. However, the resolution of OCTA images is inversely proportional to the field of view at the same sampling frequency, which is not conducive to clinicians for analyzing larger vascular areas. In this paper, we propose a novel S parse-based domain A daptation S uper- R esolution network (SASR) for the reconstruction of realistic $6\times \text{6}{\rm{mm}}^{2}$ /low-resolution (LR) OCTA images to high-resolution (HR) representations. To be more specific, we first perform a simple degradation of the $3\times \text{3}\, {\rm{mm}}^{2}$ /high-resolution (HR) image to obtain the synthetic LR image. An efficient registration method is then employed to register the synthetic LR with its corresponding $3\times \text{3}\, {\rm{mm}}^{2}$ image region within the $6\times \text{6}\, {\rm{mm}}^{2}$ image to obtain the cropped realistic LR image. We then propose a multi-level super-resolution model for the fully-supervised reconstruction of the synthetic data, guiding the reconstruction of the realistic LR images through a generative-adversarial strategy that allows the synthetic and realistic LR images to be unified in the feature domain. Finally, a novel sparse edge-aware loss is designed to dynamically optimize the vessel edge structure. Extensive experiments on two OCTA sets have shown that our method performs better than state-of-the-art super-resolution reconstruction methods. In addition, we have investigated the performance of the reconstruction results on retina structure segmentations, which further validate the effectiveness of our approach.

Details DOI

AIIM Journal 2019 Journal Article

Referable diabetic retinopathy identification from eye fundus images with weighted path for convolutional neural network

Yi-Peng Liu
Zhanqing Li
Cong Xu
Jing Li
Ronghua Liang

Details DOI

NeurIPS Conference 2017 Conference Paper

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

Wei Wen
Cong Xu
Feng Yan
Chunpeng Wu
Yandan Wang
Yiran Chen
Hai Li

High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1, 0, 1}, which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks. Our source code is available.

PDF Details