Author name cluster

Jiangyan Yi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

ICML Conference 2025 Conference Paper

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Zheng Lian 0004
Haoyu Chen 0001
Lan Chen 0005
Haiyang Sun 0004
Licai Sun
Yong Ren
Zebang Cheng
Bin Liu 0041

The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level—from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT’s robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https: //github. com/zeroQiaoba/AffectGPT.

Details

AAAI Conference 2025 Conference Paper

Code-switching Mediated Sentence-level Semantic Learning

Shuai Zhang
Jiangyan Yi
Zhengqi Wen
Jianhua Tao
Feihu Che
Jinyang Wu
Ruibo Fu

Code-switching is a linguistic phenomenon in which different languages are used interactively during conversation. It poses significant performance challenges to natural language processing (NLP) tasks due to the often monolingual nature of the underlying system. We focus on sentence-level semantic associations between the different code-switching expressions. And we propose an innovative task-free semantic learning method based on the semantic property. Specifically, there are many different ways of languages switching for a sentence with the same meaning. We refine this into a semantic computational method by designing the loss of semantic invariant constraint during the model optimization. In this work, we conduct thorough experiments on speech recognition, speech translation, and language modeling tasks. The experimental results fully demonstrate that the proposed method can widely improve the performance of code-switching related tasks.

PDF Details DOI

ICML Conference 2025 Conference Paper

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

Zheng Lian 0004
Haiyang Sun 0004
Licai Sun
Haoyu Chen 0001
Lan Chen 0005
Hao Gu
Zhuofan Wen 0001
Shun Chen

Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept of open vocabulary into MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm: Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at: https: //github. com/zeroQiaoba/AffectGPT.

Details

AAAI Conference 2025 Conference Paper

Region-Based Optimization in Continual Learning for Audio Deepfake Detection

Yujie Chen
Jiangyan Yi
Cunhang Fan
Jianhua Tao
Yong Ren
Siding Zeng
Chu Yuan Zhang
Xinrui Yan

Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the model’s ability to learn more generalized discriminative features. Experimental results show our method achieves a 21.3 percent improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition.

PDF Details DOI

AIJ Journal 2024 Journal Article

Emotion selectable end-to-end text-based speech editing

Tao Wang
Jiangyan Yi
Ruibo Fu
Jianhua Tao
Zhengqi Wen
Chu Yuan Zhang

Text-based speech editing is a convenient way for users to edit speech by intuitively cutting, copying, and pasting text. Previous work introduced CampNet, a context-aware mask prediction network that significantly improved the quality of edited speech. However, this paper proposes a new task: adding emotional effects to the edited speech during text-based speech editing to enhance the expressiveness and controllability of the edited speech. To achieve this, we introduce Emo-CampNet, which allows users to select emotional attributes for the generated speech and has the ability to edit the speech of unseen speakers. Firstly, the proposed end-to-end model controls the generated speech's emotion by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent emotional interference from the original speech, a neutral content generator is proposed to remove the emotional components, which is optimized using the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set. Experimental results 1 show that Emo-CampNet effectively controls the generated speech's emotion and can edit the speech of unseen speakers. Ablation experiments further validate the effectiveness of emotional selectivity and data augmentation methods.

Details DOI

AAAI Conference 2024 Conference Paper

What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection

Xiaohui Zhang
Jiangyan Yi
Chenglong Wang
Chu Yuan Zhang
Siding Zeng
Jianhua Tao

The rapid evolution of speech synthesis and voice conversion has raised substantial concerns due to the potential misuse of such technology, prompting a pressing need for effective audio deepfake detection mechanisms. Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types. To address this challenge, one of the emergent effective approaches is continual learning. In this paper, we propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection. The fundamental concept underlying RWM involves categorizing all classes into two groups: those with compact feature distributions across tasks, such as genuine audio, and those with more spread-out distributions, like various types of fake audio. These distinctions are quantified by means of the in-class cosine distance, which subsequently serves as the basis for RWM to introduce a trainable gradient modification direction for distinct data types. Experimental evaluations against mainstream continual learning methods reveal the superiority of RWM in terms of knowledge acquisition and mitigating forgetting in audio deepfake detection. Furthermore, RWM's applicability extends beyond audio deepfake detection, demonstrating its potential significance in diverse machine learning domains such as image recognition.

PDF Details DOI

ICML Conference 2023 Conference Paper

Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Xiaohui Zhang 0006
Jiangyan Yi
Jianhua Tao 0001
Chenglong Wang 0001
Chu Yuan Zhang

Current fake audio detection algorithms have achieved promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of genuine audio across different datasets. To overcome this limitation, we propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting, called Regularized Adaptive Weight Modification (RAWM). When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. The adaptive modification direction ensures the network can effectively detect fake audio on the new dataset while preserving its knowledge of old model, thus mitigating catastrophic forgetting. In addition, genuine audio collected from quite different acoustic conditions may skew their feature distribution, so we introduce a regularization constraint to force the network to remember the old distribution in this regard. Our method can easily be generalized to related fields, like speech emotion recognition. We also evaluate our approach across multiple datasets and obtain a significant performance improvement on cross-dataset experiments.

Details