Author name cluster

Shangfei Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers

1 author row

AAAI Conference 2026 Conference Paper

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Zhenjie Liu
Jianzhang Lu
Renjie Lu
Cong Liang
Shangfei Wang

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce ConsistTalk, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose an optical flow-guided temporal module (OFT) that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an Audio-to-Intensity (A2I) model obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a diffusion noise initialization strategy (IC-Init). By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

PDF Details DOI

AAAI Conference 2026 Conference Paper

FINE: Factorized Multimodal Sentiment Analysis via Mutual INformation Estimation

Yadong Liu
Shangfei Wang

Multimodal sentiment analysis remains a challenging task due to the inherent heterogeneity across modalities. Such heterogeneity often manifests as asynchronous signals, imbalanced information between modalities, and interference from task-irrelevant noise, hindering the learning of robust and accurate sentiment representations. To address these issues, we propose a factorized multimodal fusion framework that first disentangles each modality into shared and unique representations, and then suppresses task-irrelevant noise within both to retain only sentiment-critical representations. This fine-grained decomposition improves representation quality by reducing redundancy, prompting cross-modal complementarity, and isolating task-relevant sentiment cues. Rather than manipulating the feature space directly, we adopt a mutual information–based optimization strategy to guide the factorization process in a more stable and principled manner. To further support feature extraction and long-term temporal modeling, we introduce two auxiliary modules: a Mixture of Q-Formers, placed before factorization, which precedes the factorization and uses learnable queries to extract fine-grained affective features from multiple modalities, and a Dynamic Contrastive Queue, placed after factorization, which stores latest high-level representations for contrastive learning, enabling the model to capture long-range discriminative patterns and improve class-level separability. Extensive experiments on multiple public datasets demonstrate that our method consistently outperforms existing approaches, validating the effecti veness and robustness of the proposed framework.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Learning Knowledge from Textual Descriptions for 3D Human Pose Estimation

Yi Wu
Jingtian Li
Shangfei Wang
Guoming Li
Meng Mao
Linxiang Tan

Mainstream 3D human pose estimation methods directly predict 3D coordinates of joints from 2D keypoints, suffering from severe depth ambiguity. Pose textual descriptions contain abundant semantic information, which facilitates the model to learn the spatial relationship among different body parts, partially alleviating this issue. Leveraging this insight, we propose a 3D human pose estimation method assisted by textual descriptions. Specifically, we utilize an automatic captioning pipeline to generate textual descriptions of 3D poses based on spatial relations among joints. These descriptions include details regarding angles, distances, relative positions, pitch\&roll and ground-contacts. Subsequently, text features are extracted from these descriptions using a language model, while a 3D human pose estimation model extracts pose features. Aligning the pose features with the text features allows for a more targeted optimization of the estimation model. Therefore, we systematically introduce three alignment approaches to effectively align features extracted by two models operating in entirely different domains. Our method incorporates prior knowledge derived from the textual descriptions into the estimation model and can be seamlessly applied to various existing framework. Experimental results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our method surpasses state-of-the-art methods.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

1DFormer: A Transformer Architecture Learning 1D Landmark Representations for Facial Landmark Tracking

Shi Yin
Shijie Huang
Shangfei Wang
Jinshui Hu
Tao Guo
Bing Yin
Baocai Yin
Cong Liu

Recently, heatmap regression methods based on 1D landmark representations have shown prominent performance on locating facial landmarks. However, previous methods ignored to make deep explorations on the good potentials of 1D landmark representations for sequential and structural modeling of multiple landmarks to track facial landmarks. To address this limitation, we propose a Transformer architecture, namely 1DFormer, which learns informative 1D landmark representations by capturing the dynamic and the geometric patterns of landmarks via token communications in both temporal and spatial dimensions for facial landmark tracking. For temporal modeling, we propose a confidence-enhanced multi-head attention mechanism with a recurrently token mixing strategy to adaptively and robustly embed long-term landmark dynamics into their 1D representations; for structure modeling, we design intra-group and inter-group geometric encoding mechanisms to encode the component-level as well as global-level facial structure patterns as a refinement for the 1D representations of landmarks through token communications in the spatial dimension via 1D convolutional layers. Experimental results on the 300VW and the TF databases show that 1DFormer successfully models the long-range sequential patterns as well as the inherent facial structures to learn informative 1D representations of landmark sequences, and achieves state-of-the-art performance on facial landmark tracking. Codes of our model are available in the supplementary materials.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

FreqMark: Invisible Image Watermarking via Frequency Based Optimization in Latent Space

Yiyang Guo
Ruizhe Li
Mude Hui
Hanzhong Guo
Chen Zhang
Chuangjian Cai
Le Wan
Shangfei Wang

Invisible watermarking is essential for safeguarding digital content, enabling copyright protection and content authentication. However, existing watermarking methods fall short in robustness against regeneration attacks. In this paper, we propose a novel method called FreqMark that involves unconstrained optimization of the image latent frequency space obtained after VAE encoding. Specifically, FreqMark embeds the watermark by optimizing the latent frequency space of the images and then extracts the watermark through a pre-trained image encoder. This optimization allows a flexible trade-off between image quality with watermark robustness and effectively resists regeneration attacks. Experimental results demonstrate that FreqMark offers significant advantages in image quality and robustness, permits flexible selection of the encoding bit number, and achieves a bit accuracy exceeding 90\% when encoding a 48-bit hidden message under various attack scenarios.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Micro-Expression Recognition Enhanced by Macro-Expression from Spatial-Temporal Domain

Bin Xia
Shangfei Wang

Facial micro-expression recognition has attracted much attention due to its objectiveness to reveal the true emotion of a person. However, the limited micro-expression datasets have posed a great challenge to train a high performance micro-expression classifier. Since micro-expression and macro-expression share some similarities in both spatial and temporal facial behavior patterns, we propose a macro-to-micro transformation framework for micro-expression recognition. Specifically, we first pretrain two-stream baseline model from micro-expression data and macro-expression data respectively, named MiNet and MaNet. Then, we introduce two auxiliary tasks to align the spatial and temporal features learned from micro-expression data and macro-expression data. In spatial domain, we introduce a domain discriminator to align the features of MiNet and MaNet. In temporal domain, we introduce relation classifier to predict the correct relation for temporal features from MaNet and MiNet. Finally, we propose contrastive loss to encourage the MiNet to give closely aligned features to all entries from the same class in each instance. Experiments on three benchmark databases demonstrate the superiority of the proposed method.

PDF Details DOI

IJCAI Conference 2019 Conference Paper

Capturing Spatial and Temporal Patterns for Facial Landmark Tracking through Adversarial Learning

Shi Yin
Shangfei Wang
Guozhu Peng
Xiaoping Chen
Bowen Pan

The spatial and temporal patterns inherent in facial feature points are crucial for facial landmark tracking, but have not been thoroughly explored yet. In this paper, we propose a novel deep adversarial framework to explore the shape and temporal dependencies from both appearance level and target label level. The proposed deep adversarial framework consists of a deep landmark tracker and a discriminator. The deep landmark tracker is composed of a stacked Hourglass network as well as a convolutional neural network and a long short-term memory network, and thus implicitly capture spatial and temporal patterns from facial appearance for facial landmark tracking. The discriminator is adopted to distinguish the tracked facial landmarks from ground truth ones. It explicitly models shape and temporal dependencies existing in ground truth facial landmarks through another convolutional neural network and another long short-term memory network. The deep landmark tracker and the discriminator compete with each other. Through adversarial learning, the proposed deep adversarial landmark tracking approach leverages inherent spatial and temporal patterns to facilitate facial landmark tracking from both appearance level and target label level. Experimental results on two benchmark databases demonstrate the superiority of the proposed approach to state-of-the-art work.

PDF Details

AAAI Conference 2019 Conference Paper

Dual Semi-Supervised Learning for Facial Action Unit Recognition

Guozhu Peng
Shangfei Wang

Current works on facial action unit (AU) recognition typically require fully AU-labeled training samples. To reduce the reliance on time-consuming manual AU annotations, we propose a novel semi-supervised AU recognition method leveraging two kinds of readily available auxiliary information. The method leverages the dependencies between AUs and expressions as well as the dependencies among AUs, which are caused by facial anatomy and therefore embedded in all facial images, independent on their AU annotation status. The other auxiliary information is facial image synthesis given AUs, the dual task of AU recognition from facial images, and therefore has intrinsic probabilistic connections with AU recognition, regardless of AU annotations. Specifically, we propose a dual semi-supervised generative adversarial network for AU recognition from partially AU-labeled and fully expressionlabeled facial images. The proposed network consists of an AU classifier C, an image generator G, and a discriminator D. In addition to minimize the supervised losses of the AU classifier and the face generator for labeled training data, we explore the probabilistic duality between the tasks using adversary learning to force the convergence of the face- AU-expression tuples generated from the AU classifier and the face generator, and the ground-truth distribution in labeled data for all training data. This joint distribution also includes the inherent AU dependencies. Furthermore, we reconstruct the facial image using the output of the AU classifier as the input of the face generator, and create AU labels by feeding the output of the face generator to the AU classifier. We minimize reconstruction losses for all training data, thus exploiting the informative feedback provided by the dual tasks. Within-database and cross-database experiments on three benchmark databases demonstrate the superiority of our method in both AU recognition and face synthesis compared to state-of-the-art works.

PDF Details

AAAI Conference 2019 Conference Paper

Image Aesthetic Assessment Assisted by Attributes through Adversarial Learning

Bowen Pan
Shangfei Wang
Qisheng Jiang

The inherent connections among aesthetic attributes and aesthetics are crucial for image aesthetic assessment, but have not been thoroughly explored yet. In this paper, we propose a novel image aesthetic assessment assisted by attributes through both representation-level and label-level. The attributes are used as privileged information, which is only required during training. Specifically, we first propose a multitask deep convolutional rating network to learn the aesthetic score and attributes simultaneously. The attributes are explored to construct better feature representations for aesthetic assessment through multi-task learning. After that, we introduce a discriminator to distinguish the predicted attributes and aesthetics of the multi-task deep network from the ground truth label distribution embedded in the training data. The multi-task deep network wants to output aesthetic score and attributes as close to the ground truth labels as possible. Thus the deep network and the discriminator compete with each other. Through adversarial learning, the attributes are explored to enforce the distribution of the predicted attributes and aesthetics to converge to the ground truth label distribution. Experimental results on two benchmark databases demonstrate the superiority of the proposed method to state of the art work.

PDF Details

AAAI Conference 2017 Conference Paper

Capturing Dependencies among Labels and Features for Multiple Emotion Tagging of Multimedia Data

Shan Wu
Shangfei Wang
Qiang Ji

In this paper, we tackle the problem of emotion tagging of multimedia data by modeling the dependencies among multiple emotions in both the feature and label spaces. These dependencies, which carry crucial top-down and bottom-up evidence for improving multimedia affective content analysis, have not been thoroughly exploited yet. To this end, we propose two hierarchical models that independently and dependently learn the shared features and global semantic relationships among emotion labels to jointly tag multiple emotion labels of multimedia data. Efﬁcient learning and inference algorithms of the proposed models are also developed. Experiments on three benchmark emotion databases demonstrate the superior performance of our methods to existing methods.

PDF Details

AAAI Conference 2017 Conference Paper

Differentiating Between Posed and Spontaneous Expressions with Latent Regression Bayesian Network

Quan Gan
Siqi Nie
Shangfei Wang
Qiang Ji

Spatial patterns embedded in human faces are crucial for differentiating posed expressions from spontaneous ones, yet they have not been thoroughly exploited in the literature. To tackle this problem, we present a generative model, i. e. , Latent Regression Bayesian Network (LRBN), to effectively capture the spatial patterns embedded in facial landmark points to differentiate between posed and spontaneous facial expressions. The LRBN is a directed graphical model consisting of one latent layer and one visible layer. Due to the “explaining away” effect in Bayesian networks, LRBN is able to capture both the dependencies among the latent variables given the observation and the dependencies among visible variables. We believe that such dependencies are crucial for faithful data representation. Speciﬁcally, during training, we construct two LRBNs to capture spatial patterns inherent in displacements of landmark points from spontaneous facial expressions and posed facial expressions respectively. During testing, the samples are classiﬁed into posed or spontaneous expressions according to their likelihoods on two models. Ef- ﬁcient learning and inference algorithms are proposed. Experimental results on two benchmark databases demonstrate the advantages of the proposed approach in modeling spatial patterns as well as its superior performance to the existing methods in differentiating between posed and spontaneous expressions.

PDF Details

AAMAS Conference 2011 Conference Paper

Towards Robot Incremental Learning Constraints from Comparative Demonstration

Rong Zhang
Shangfei Wang
Xiaoping Chen
Dong Yin
Shijia Chen
Min Cheng
Yanpeng Lv
Jianmin Ji

This paper presents an attempt on incremental robot learning from demonstration. Based on previously learnt knowledge about a task in simpler situations, a robot learns to fulfill the same task properly in a more complicated situation through analyzing comparative demonstrations and extracting new knowledge, especially the constraints that the task in the new situation imposes on the robot's behaviors.

PDF