Author name cluster

Ruibo Fu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

AAAI Conference 2026 Conference Paper

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Yuankun Xie
Ruibo Fu
Xiaopeng Wang
Zhiyong Wang
Songjun Cao
Long Ma
Haonan Cheng
Long Ye

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458× fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis

Heng Xie
Kang Zhu
Zhengqi Wen
Jianhua Tao
Xuefei Liu
Ruibo Fu
Changsheng Li

Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Trainable EEG Interpolation and Structure-Sharing Dual-Path Encoders for Brain-Assisted Target Speaker Extraction

Zhao Lv
Haoran Zhou
Ying Chen
Youdian Gao
Xinhui Li
Ruibo Fu
Cunhang Fan

Brain-assisted target speaker extraction (TSE) isolates a target speaker's voice from a mixture by leveraging task-specific representations in Electroencephalogram (EEG) signals. However, existing methods rely on fixed interpolation for EEG-audio alignment, introducing redundant computations. They also employ single-path encoders that extract only target-relevant features while neglecting complementary, irrelevant ones, limiting discriminability. To address these limitations, this paper proposes a Trainable EEG Interpolation and Structure-sharing Dual-path Encoders network (TIDENet). The proposed Trainable EEG Interpolation (TEI) uses a neural network module to leverage cross-sample EEG information during resampling by parameters updating, thereby overcoming the limitations of fixed interpolation. The Structure-sharing Dual-path Encoders (SSDPE) extend existing speech and EEG encoders by introducing dual paths that separately process features relevant and irrelevant to the target speaker and incorporates interactive fusion between them, which enhances the encoder's ability to capture task-relevant information. Experimental results on public datasets demonstrate that TIDENet achieves relative improvements of up to 20.47%, 22.22%, 2.91%, 6.20%, and 15.84% in signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), short-time objective intelligibility (STOI), extended STOI (ESTOI), and perceptual evaluation of speech quality (PESQ), respectively, compared to the state-of-the-art. These significant gains validate the effectiveness of the proposed TEI method and SSDPE architecture.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Code-switching Mediated Sentence-level Semantic Learning

Shuai Zhang
Jiangyan Yi
Zhengqi Wen
Jianhua Tao
Feihu Che
Jinyang Wu
Ruibo Fu

Code-switching is a linguistic phenomenon in which different languages are used interactively during conversation. It poses significant performance challenges to natural language processing (NLP) tasks due to the often monolingual nature of the underlying system. We focus on sentence-level semantic associations between the different code-switching expressions. And we propose an innovative task-free semantic learning method based on the semantic property. Specifically, there are many different ways of languages switching for a sentence with the same meaning. We refine this into a semantic computational method by designing the loss of semantic invariant constraint during the model optimization. In this work, we conduct thorough experiments on speech recognition, speech translation, and language modeling tasks. The experimental results fully demonstrate that the proposed method can widely improve the performance of code-switching related tasks.

PDF Details DOI

AIJ Journal 2024 Journal Article

Emotion selectable end-to-end text-based speech editing

Tao Wang
Jiangyan Yi
Ruibo Fu
Jianhua Tao
Zhengqi Wen
Chu Yuan Zhang

Details DOI