Author name cluster

Shouhong Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers

2 author rows

AAAI Conference 2026 Conference Paper

D²Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

Evelyn Zhang
Fufu Yu
Aoqi Wu
Zichen Wen
Ke Yan
Shouhong Ding
Biqing Qi
Linfeng Zhang

Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D²Pruner achieves exceptional efficiency and fidelity.

PDF Details DOI

AAAI Conference 2026 Conference Paper

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

Xuan Zhao
Zhongyu Zhang
Yuge Huang
Yuxi Mi
Guodong Mu
Shouhong Ding
Jun Wang
Rizen Guo

Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module which recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LSAP-PV: High-Fidelity Palm Vein Image Synthesis via Layered Spectral Absorption Projection-Guided Diffusion Model

Sheng Shang
Chenglong Zhao
Ruixin Zhang
Jianlong Jin
Jingyun Zhang
Jun Wang
Yang Zhao
Shouhong Ding

Palm vein recognition has emerged as a promising biometric technology, yet its development remains constrained by the scarcity of large-scale publicly available datasets. Several methods of palm vein image generation have been proposed to address this issue. These methods usually focus on the anatomical realism of palm vein patterns, but overlook the biophysical correlation between identities and vein patterns, particularly in simulating identity-specific vein contrast. To tackle this limitation, we propose a novel biophysics-driven synthesis method. Our method constructs a 3D palm vascular tree via established modeling method. Then, a projection model is proposed to map the 3D tree into 2D space to derive palm vein patterns. The projection model is based on skin spectral absorption and simulates the natural attenuation of light passing through the skin using a layer integration method. For different identities, we sample different skin parameters, resulting in varying degrees of attenuation. This method effectively simulates the variation in vein contrast across different identities. Furthermore, we introduce a conditional diffusion model that uses the projected patterns as identity conditions to generate palm vein images. To the best of our knowledge, this is the first palm vein generation method based on the diffusion model. Experimental results demonstrate that our method not only outperforms existing methods, but also enables a recognition model trained on our synthetic data to achieve superior performance compared to a model trained on real-world data at a scale of 2,000 IDs under an open-set protocol with a TAR@FAR=1:1 of 1e-4.

PDF Details DOI

AAAI Conference 2026 Conference Paper

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Yuchen Bao
Yiting Wang
Wenjian Huang
Haowei Wang
Shen Chen
Taiping Yao
Shouhong Ding
Jianguo Zhang

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect—such as editing text content—thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and preventing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

Ruoxin Chen
Junwei Xi
Zhiyuan Yan
Ke-Yue Zhang
Shuang Wu
Jingyi Xie
Xu Chen
Lei Xu

The rapid increase in AI-generated images (AIGIs) underscores the need for detection methods. Existing detectors are often trained on biased datasets, leading to overfitting on spurious correlations between non-causal image attributes and real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when tested on unbiased datasets. A common solution is to perform data alignment through generative reconstruction, matching the content between real and synthetic images. However, we find that pixel-level alignment alone is inadequate, as the reconstructed images still suffer from frequency-level misalignment, perpetuating spurious correlations. To illustrate, we observe that reconstruction models restore the high-frequency details lost in real images, inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. DDA generates synthetic images that closely resemble real ones by fusing real and synthetic image pairs in both domains, enhancing the detector's ability to identify forgeries without relying on biased features. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images, and EvalGEN, featuring the latest generative models. Our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO improves across diverse benchmarks. Code is available at https: //github. com/roy-ch/Dual-Data-Alignment.

PDF Details

IJCAI Conference 2025 Conference Paper

EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR

Zhengyuan Peng
Jianqing Xu
Shen Li
Jiazhen Ji
Yuge Huang
Jingyun Zhang
Jinmin Li
Shouhong Ding

Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

Kaiqing Lin
Zhiyuan Yan
Ke-Yue Zhang
Li Hao
Yue Zhou
Yuzhen Lin
Weixiang Li
Taiping Yao

Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e. g. , "VIP individuals" whose authentic facial data are already available. In this paper, we propose VIPGuard, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, we fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we build a comprehensive identity-aware benchmark called VIPBench for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation. Extensive experiments show that our model outperforms existing methods in both detection and explanation. The code is available at https: //github. com/KQL11/VIPGuard.

PDF Details

AAAI Conference 2025 Conference Paper

Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning

Weijian Ma
Ruoxin Chen
Keyue Zhang
Shuang Wu
Shouhong Ding

Data augmentation is expected to bring about unseen features of training set, enhancing the model’s ability to generalize in situations where data is limited. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with stereotypes and imperceptible bias when used to augment training data, owing to dataset misalignment and the generator’s ignorance of the downstream model. We improve downstream task awareness in generated images by proposing a task-aware fine-tuning strategy that actively detects failures of downstream task in the target model to fine-tune the generation process between epochs. The dynamic fine-tuning strategy is achieved by (1) inspecting misalignment between generated data and original data via VLM captioners and (2) adjusts both prompts and diffusion model so that the strategy dynamically guides the generator by focusing on the detected bias of VLM. This is done via re-captioning the overfitted data as well as finetuning the diffusion trajectory in a contrastive manner. To co-operate with the VLM captioner, the contrastive fine-tuning process dynamically adjusts different parts of the diffusion trajectory based on detected misalignment, thus shifting the the generated distribution away from making the downstream model overfit. Our experiments on few-shot class incremental learning show that our instruction-guided finetuning strategy consistently assists the downstream model with higher classification accuracy compared to generative data augmentation baselines such as Stable Diffusion and GPT-4o, and state-of-the-art non-generative strategies.

PDF Details DOI

ICML Conference 2025 Conference Paper

Large Continual Instruction Assistant

Jingyang Qiao
Zhizhong Zhang 0001
Xin Tan 0002
Yanyun Qu
Shouhong Ding
Yuan Xie 0006

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https: //github. com/JingyangQiao/CoIN.

Details

ICML Conference 2025 Conference Paper

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan 0002
Jiangming Wang
Peng Jin 0001
Ke-Yue Zhang
Chengchun Liu
Shen Chen 0004
Taiping Yao
Shouhong Ding

Detecting AI-generated images (AIGIs), such as natural images or face images, has become increasingly important yet challenging. In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named the asymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization. One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model’s overfitting to fake. To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces. By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns. Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality, enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization. We finally identify a crucial insight: our method implicitly learns a vital prior that fakes are actually derived from the real, indicating a hierarchical relationship rather than independence. Modeling this prior, we believe, is essential for achieving superior generalization. Our codes are publicly available at https: //github. com/YZY-stack/Effort-AIGI-Detection.

Details

ICML Conference 2025 Conference Paper

PiD: Generalized AI-Generated Images Detection with Pixelwise Decomposition Residuals

Xinghe Fu
Zhiyuan Yan 0002
Zheng Yang
Taiping Yao
Yandan Zhao
Shouhong Ding
Xi Li 0001

Fake images, created by recently advanced generative models, have become increasingly indistinguishable from real ones, making their detection crucial, urgent, and challenging. This paper introduces PiD (Pixelwise Decomposition Residuals), a novel detection method that focuses on residual signals within images. Generative models are designed to optimize high-level semantic content (principal components), often overlooking low-level signals (residual components). PiD leverages this observation by disentangling residual components from images, encouraging the model to uncover more underlying and general forgery clues independent of semantic content. Compared to prior approaches that rely on reconstruction techniques or high-frequency information, PiD is computationally efficient and does not rely on any generative models for reconstruction. Specifically, PiD operates at the pixel level, mapping the pixel vector to another color space (e. g. , YUV) and then quantizing the vector. The pixel vector is mapped back to the RGB space and the quantization loss is taken as the residual for AIGC detection. Our experiment results are striking and highly surprising: PiD achieves 98% accuracy on the widely used GenImage benchmark, highlighting the effectiveness and generalization performance.

Details

AAAI Conference 2025 Conference Paper

PVTree: Realistic and Controllable Palm Vein Generation for Recognition Tasks

Sheng Shang
Chenglong Zhao
Ruixin Zhang
Jianlong Jin
Jingyun Zhang
Rizen Guo
Shouhong Ding
Yunsheng Wu

Palm vein recognition is an emerging biometric technology that offers enhanced security and privacy. However, acquiring sufficient palm vein data for training deep learning-based recognition models is challenging due to the high costs of data collection and privacy protection constraints. This has led to a growing interest in generating pseudo-palm vein data using generative models. Existing methods, however, often produce unrealistic palm vein patterns or struggle with controlling identity and style attributes. To address these issues, we propose a novel palm vein generation framework named PVTree. First, the palm vein identity is defined by a complex and authentic 3D palm vascular tree, created using an improved Constrained Constructive Optimization (CCO) algorithm. Second, palm vein patterns of the same identity are generated by projecting the same 3D vascular tree into 2D images from different views and converting them into realistic images using a generative model. As a result, PVTree satisfies the need for both identity consistency and intra-class diversity. Extensive experiments conducted on several publicly available datasets demonstrate that our proposed palm vein generation method surpasses existing methods and achieves a higher TAR@FAR=1e-4 under the 1:1 Open-set protocol. To the best of our knowledge, this is the first time that the performance of a recognition model trained on synthetic palm vein data exceeds that of the recognition model trained on real data, which indicates that palm vein image generation research has a promising future.

PDF Details DOI

AAAI Conference 2025 Conference Paper

SlerpFace: Face Template Protection via Spherical Linear Interpolation

Zhizhou Zhong
Yuxi Mi
Yuge Huang
Jianqing Xu
Guodong Mu
Shouhong Ding
Jingyun Zhang
Rizen Guo

Contemporary face recognition systems use feature templates extracted from face images to identify persons. To enhance privacy, face template protection techniques are widely employed to conceal sensitive identity and appearance information stored in the template. This paper identifies an emerging privacy attack form utilizing diffusion models that could nullify prior protection. The attack can synthesize high-quality, identity-preserving face images from templates, revealing persons' appearance. Based on studies of the diffusion model's generative capability, this paper proposes a defense by rotating templates to a noise-like distribution. This is achieved efficiently by spherically and linearly interpolating templates on their located hypersphere. This paper further proposes to group-wisely divide and drop out templates' feature dimensions, to enhance the irreversibility of rotated templates. The proposed techniques are concretized as a novel face template protection technique, SlerpFace. Extensive experiments show that SlerpFace provides satisfactory recognition accuracy and comprehensive protection against inversion and other attack forms, superior to prior arts.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Switchable Token-Specific Codebook Quantization For Face Image Compression

Yongbo Wang
Haonan Wang
Guodong Mu
Ruixin Zhang
Jiaqi Chen
Jingyun Zhang
Jun Wang
Yuan Xie

With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images—which are rich in attributes—such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93. 51\% for reconstructed images at 0. 05 bpp.

PDF Details

ICLR Conference 2025 Conference Paper

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

Yuanchen Wu
Junlong Du
Ke Yan
Shouhong Ding
Xiaoqiang Li 0002

Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP image encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows selective detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

Details

ICML Conference 2025 Conference Paper

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Yuanchen Wu
Ke Yan
Shouhong Ding
Ziyin Zhou
Xiaoqiang Li 0002

Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

Details

ICLR Conference 2025 Conference Paper

UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition

Xiao Lin
Yuge Huang
Jianqing Xu
Yuxi Mi
Shuigeng Zhou
Shouhong Ding

Face recognition (FR) stands as one of the most crucial applications in computer vision. The accuracy of FR models has significantly improved in recent years due to the availability of large-scale human face datasets. However, directly using these datasets can inevitably lead to privacy and legal problems. Generating synthetic data to train FR models is a feasible solution to circumvent these issues. While existing synthetic-based face recognition methods have made significant progress in generating identity-preserving images, they are severely plagued by context overfitting, resulting in a lack of intra-class diversity of generated images and poor face recognition performance. In this paper, we propose a framework to $\textbf{U}$nleash model $\textbf{I}$nherent capabilities to enhance intra-class diversity for synthetic face recognition, shorted as $\textbf{UIFace}$. Our framework first train a diffusion model that can perform denoising conditioned on either identity contexts or a learnable empty context. The former generates identity-preserving images but lacks variations, while the latter exploits the model's intrinsic ability to synthesize intra-class-diversified images but with random identities. Then we adopt a novel two-stage denoising strategy to fully leverage the strengths of both type of contexts, resulting in images that are diverse as well as identity-preserving. Moreover, an attention injection module is introduced to further augment the intra-class variations by utilizing attention maps from the empty context to guide the denoising process in ID-conditioned generation. Experiments show that our method significantly surpasses previous approaches with even less training data and half the size of synthetic dataset. More surprisingly, the proposed $\textbf{UIFace}$ even achieves comparable performance of FR models trained on real datasets when we increase the number of synthetic identities.

Details

NeurIPS Conference 2024 Conference Paper

$\text{ID}^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition

Jianqing Xu
Shen Li
Jiaying Wu
Miao Xiong
Ailin Deng
Jiazhen Ji
Yuge Huang
Guodong Mu

Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed $\text{ID}^3$. $\text{ID}^3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of $\text{ID}^3$.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DF40: Toward Next-Generation Deepfake Detection

Zhiyuan Yan
Taiping Yao
Shen Chen
Yandan Zhao
Xinghe Fu
Junwei Zhu
Donghao Luo
Chengjie Wang

We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset ( e. g. , FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to the following: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC, especially face). Most existing datasets only contain partial types of them, with limited forgery methods implemented ( e. g. , 2 swapping and 2 reenactment methods in FF++); (2) forgery realism: The dominated training dataset, FF++, contains out-of-date forgery techniques from the past four years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection generalization toward nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, e. g. , face-swapping types only, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse and large-scale deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques (10 times larger than FF++). We then conduct comprehensive evaluations using 4 standard evaluation protocols and 8 representative detection methods, resulting in over 2, 000 evaluations. Through these evaluations, we provide an extensive analysis from various perspectives, leading to 7 new insightful findings contributing to the field. We also open up 4 valuable yet previously underexplored research questions to inspire future works. We release our dataset, code, and pre-trained weights at https: //github. com/YZY-stack/DF40.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion

Ke Sun
Shen Chen
Taiping Yao
Hong Liu
Xiaoshuai Sun
Shouhong Ding
Rongrong Ji

The rapid progress of Deepfake technology has made face swapping highly realistic, raising concerns about the malicious use of fabricated facial content. Existing methods often struggle to generalize to unseen domains due to the diverse nature of facial manipulations. In this paper, we revisit the generation process and identify a universal principle: Deepfake images inherently contain information from both source and target identities, while genuine faces maintain a consistent identity. Building upon this insight, we introduce DiffusionFake, a novel plug-and-play framework that reverses the generative process of face forgeries to enhance the generalization of detection models. DiffusionFake achieves this by injecting the features extracted by the detection model into a frozen pre-trained Stable Diffusion model, compelling it to reconstruct the corresponding target and source images. This guided reconstruction process constrains the detection network to capture the source and target related features to facilitate the reconstruction, thereby learning rich and disentangled representations that are more resilient to unseen forgeries. Extensive experiments demonstrate that DiffusionFake significantly improves cross-domain generalization of various detector architectures without introducing additional parameters during inference. The code are available in https: //github. com/skJack/DiffusionFake. git.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing

Chengyang Hu
Ke-Yue Zhang
Taiping Yao
Shice Liu
Shouhong Ding
Xin Tan
Lizhuang Ma

Multi-Domain Face Anti-Spoofing (MD-FAS) is a practical setting that aims to update models on new domains using only novel data while ensuring that the knowledge acquired from previous domains is not forgotten. Prior methods utilize the responses from models to represent the previous domain knowledge or map the different domains into separated feature spaces to prevent forgetting. However, due to domain gaps, the responses of new data are not as accurate as those of previous data. Also, without the supervision of previous data, separated feature spaces might be destroyed by new domains while updating, leading to catastrophic forgetting. Inspired by the challenges posed by the lack of previous data, we solve this issue from a new standpoint that generates hallucinated previous data for updating FAS model. To this end, we propose a novel Domain-Hallucinated Updating (DHU) framework to facilitate the hallucination of data. Specifically, Domain Information Explorer learns representative domain information of the previous domains. Then, Domain Information Hallucination module transfers the new domain data to pseudo-previous domain ones. Moreover, Hallucinated Features Joint Learning module is proposed to asymmetrically align the new and pseudo-previous data for real samples via dual levels to learn more generalized features, promoting the results on all domains. Our experimental results and visualizations demonstrate that the proposed method outperforms state-of-the-art competitors in terms of effectiveness.

PDF Details DOI

AAAI Conference 2024 Conference Paper

HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting

Qihe Huang
Lei Shen
Ruixin Zhang
Jiahuan Cheng
Shouhong Ding
Zhengyang Zhou
Yang Wang

Multivariate time series (MTS) prediction has been widely adopted in various scenarios. Recently, some methods have employed patching to enhance local semantics and improve model performance. However, length-fixed patch are prone to losing temporal boundary information, such as complete peaks and periods. Moreover, existing methods mainly focus on modeling long-term dependencies across patches, while paying little attention to other dimensions (e.g., short-term dependencies within patches and complex interactions among cross-variavle patches). To address these challenges, we propose a pure MLP-based HDMixer, aiming to acquire patches with richer semantic information and efficiently modeling hierarchical interactions. Specifically, we design a Length-Extendable Patcher (LEP) tailored to MTS, which enriches the boundary information of patches and alleviates semantic incoherence in series. Subsequently, we devise a Hierarchical Dependency Explorer (HDE) based on pure MLPs. This explorer effectively models short-term dependencies within patches, long-term dependencies across patches, and complex interactions among variables. Extensive experiments on 9 real-world datasets demonstrate the superiority of our approach. The code is available at https://github.com/hqh0728/HDMixer.

PDF Details DOI

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

Yi Xin
Junlong Du
Qiang Wang
Ke Yan
Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

PDF Details DOI

ICML Conference 2024 Conference Paper

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

Didi Zhu
Zhongyi Sun 0002
Zexi Li 0001
Tao Shen 0002
Ke Yan
Shouhong Ding
Chao Wu 0001
Kun Kuang 0001

Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10%) of fine-tuned parameters, maintaining $\sim$ 99% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the model patch, based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to decorate the patch, enhancing the model’s performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1. 5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.

Details

AAAI Conference 2024 Conference Paper

PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation

Jianlong Jin
Lei Shen
Ruixin Zhang
Chenglong Zhao
Ge Jin
Jingyun Zhang
Shouhong Ding
Yang Zhao

The lack of large-scale data seriously hinders the development of palmprint recognition. Recent approaches address this issue by generating large-scale realistic pseudo palmprints from Bézier curves. However, the significant difference between Bézier curves and real palmprints limits their effectiveness. In this paper, we divide the Bézier-Real difference into creases and texture differences, thus reducing the generation difficulty. We introduce a new palm crease energy (PCE) domain as a bridge from Bézier curves to real palmprints and propose a two-stage generation model. The first stage generates PCE images (realistic creases) from Bézier curves, and the second stage outputs realistic palmprints (realistic texture) with PCE images as input. In addition, we also design a lightweight plug-and-play line feature enhancement block to facilitate domain transfer and improve recognition performance. Extensive experimental results demonstrate that the proposed method surpasses state-of-the-art methods. Under extremely few data settings like 40 IDs (only 2.5% of the total training set), our model achieves a 29% improvement over RPG-Palm and outperforms ArcFace with 100% training set by more than 6% in terms of TAR@FAR=1e-6.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models

Linglan Zhao
Xuerui Zhang
Ke Yan
Shouhong Ding
Weiran Huang

Continual learning aims to incrementally acquire new concepts in data streams while resisting forgetting previous knowledge. With the rise of powerful pre-trained models (PTMs), there is a growing interest in training incremental learning systems using these foundation models, rather than learning from scratch. Existing works often view PTMs as a strong initial point and directly apply parameter-efficient tuning (PET) in the first session for adapting to downstream tasks. In the following sessions, most methods freeze model parameters for tackling forgetting issues. However, applying PET directly to downstream data cannot fully explore the inherent knowledge in PTMs. Additionally, freezing the parameters in incremental sessions hinders models' plasticity to novel concepts not covered in the first session. To solve the above issues, we propose a Slow And Fast parameter-Efficient tuning (SAFE) framework. In particular, to inherit general knowledge from foundation models, we include a transfer loss function by measuring the correlation between the PTM and the PET-applied model. After calibrating in the first session, the slow efficient tuning parameters can capture more informative features, improving generalization to incoming classes. Moreover, to further incorporate novel concepts, we strike a balance between stability and plasticity by fixing slow efficient tuning parameters and continuously updating the fast ones. Specifically, a cross-classification loss with feature alignment is proposed to circumvent catastrophic forgetting. During inference, we introduce an entropy-based aggregation strategy to dynamically utilize the complementarity in the slow and fast learners. Extensive experiments on seven benchmark datasets verify the effectiveness of our method by significantly surpassing the state-of-the-art.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Attack Can Benefit: An Adversarial Approach to Recognizing Facial Expressions under Noisy Annotations

Jiawen Zheng
Bo Li
Shengchuan Zhang
Shuang Wu
Liujuan Cao
Shouhong Ding

The real-world Facial Expression Recognition (FER) datasets usually exhibit complex scenarios with coupled noise annotations and imbalanced classes distribution, which undoubtedly impede the development of FER methods. To address the aforementioned issues, in this paper, we propose a novel and flexible method to spot noisy labels by leveraging adversarial attack, termed as Geometry Aware Adversarial Vulnerability Estimation (GAAVE). Different from existing state-of-the-art methods of noisy label learning (NLL), our method has no reliance on additional information and is thus easy to generalize to the large-scale real-world FER datasets. Besides, the combination of Dataset Splitting module and Subset Refactoring module mitigates the impact of class imbalance, and the Self-Annotator module facilitates the sufficient use of all training data. Extensive experiments on RAF-DB, FERPlus, AffectNet, and CIFAR-10 datasets validate the effectiveness of our method. The stabilized enhancement based on different methods demonstrates the flexibility of our proposed GAAVE.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Content-based Unrestricted Adversarial Attack

Zhaoyu Chen
Bo Li
Shuang Wu
Kaixun Jiang
Shouhong Ding
Wenqiang Zhang

Unrestricted adversarial attacks typically manipulate the semantic content of an image (e. g. , color or texture) to create adversarial examples that are both effective and photorealistic, demonstrating their ability to deceive human perception and deep neural networks with stealth and success. However, current works usually sacrifice unrestricted degrees and subjectively select some image content to guarantee the photorealism of unrestricted adversarial examples, which limits its attack performance. To ensure the photorealism of adversarial examples and boost attack performance, we propose a novel unrestricted attack framework called Content-based Unrestricted Adversarial Attack. By leveraging a low-dimensional manifold that represents natural images, we map the images onto the manifold and optimize them along its adversarial direction. Therefore, within this framework, we implement Adversarial Content Attack (ACA) based on Stable Diffusion and can generate high transferable unrestricted adversarial examples with various adversarial contents. Extensive experimentation and visualization demonstrate the efficacy of ACA, particularly in surpassing state-of-the-art attacks by an average of 13. 3-50. 4\% and 16. 8-48. 0\% in normally trained models and defense methods, respectively.

PDF Details

NeurIPS Conference 2023 Conference Paper

CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement

Qihe Huang
Lei Shen
Ruixin Zhang
Shouhong Ding
Binwu Wang
Zhengyang Zhou
Yang Wang

Recently, multivariate time series (MTS) forecasting techniques have seen rapid development and widespread applications across various fields. Transformer-based and GNN-based methods have shown promising potential due to their strong ability to model interaction of time and variables. However, by conducting a comprehensive analysis of the real-world data, we observe that the temporal fluctuations and heterogeneity between variables are not well handled by existing methods. To address the above issues, we propose CrossGNN, a linear complexity GNN model to refine the cross-scale and cross-variable interaction for MTS. To deal with the unexpected noise in time dimension, an adaptive multi-scale identifier (AMSI) is leveraged to construct multi-scale time series with reduced noise. A Cross-Scale GNN is proposed to extract the scales with clearer trend and weaker noise. Cross-Variable GNN is proposed to utilize the homogeneity and heterogeneity between different variables. By simultaneously focusing on edges with higher saliency scores and constraining those edges with lower scores, the time and space complexity (i. e. , $O(L)$) of CrossGNN can be linear with the input sequence length $L$. Extensive experimental results on 8 real-world MTS datasets demonstrate the effectiveness of CrossGNN compared with state-of-the-art methods.

PDF Details

AAAI Conference 2023 Conference Paper

Delving into the Adversarial Robustness of Federated Learning

Jie Zhang
Bo Li
Chen Chen
Lingjuan Lyu
Shuang Wu
Shouhong Ding
Chao Wu

In Federated Learning (FL), models are as fragile as centrally trained models against adversarial examples. However, the adversarial robustness of federated learning remains largely unexplored. This paper casts light on the challenge of adversarial robustness of federated learning. To facilitate a better understanding of the adversarial vulnerability of the existing FL methods, we conduct comprehensive robustness evaluations on various attacks and adversarial training methods. Moreover, we reveal the negative impacts induced by directly adopting adversarial training in FL, which seriously hurts the test accuracy, especially in non-IID settings. In this work, we propose a novel algorithm called Decision Boundary based Federated Adversarial Training (DBFAT), which consists of two components (local re-weighting and global regularization) to improve both accuracy and robustness of FL systems. Extensive experiments on multiple datasets demonstrate that DBFAT consistently outperforms other baselines under both IID and non-IID settings.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition

Shuai Jia
Bangjie Yin
Taiping Yao
Shouhong Ding
Chunhua Shen
Xiaokang Yang
Chao Ma

Deep learning models have shown their vulnerability when dealing with adversarial attacks. Existing attacks almost perform on low-level instances, such as pixels and super-pixels, and rarely exploit semantic clues. For face recognition attacks, existing methods typically generate the l_p-norm perturbations on pixels, however, resulting in low attack transferability and high vulnerability to denoising defense models. In this work, instead of performing perturbations on the low-level pixels, we propose to generate attacks through perturbing on the high-level semantics to improve attack transferability. Specifically, a unified flexible framework, Adversarial Attributes (Adv-Attribute), is designed to generate inconspicuous and transferable attacks on face recognition, which crafts the adversarial noise and adds it into different attributes based on the guidance of the difference in face recognition features from the target. Moreover, the importance-aware attribute selection and the multi-objective optimization strategy are introduced to further ensure the balance of stealthiness and attacking strength. Extensive experiments on the FFHQ and CelebA-HQ datasets show that the proposed Adv-Attribute method achieves the state-of-the-art attacking success rates while maintaining better visual effects against recent attack methods.

PDF Details

AAAI Conference 2022 Conference Paper

Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection

Zhihao Gu
Yang Chen
Taiping Yao
Shouhong Ding
Jilin Li
Lizhuang Ma

The rapid development of facial manipulation techniques has aroused public concerns in recent years. Existing deepfake video detection approaches attempt to capture the discriminative features between real and fake faces based on temporal modelling. However, these works impose supervisions on sparsely sampled video frames but overlook the local motions among adjacent frames, which instead encode rich inconsistency information that can serve as an efficient indicator for DeepFake video detection. To mitigate this issue, we delves into the local motion and propose a novel sampling unit named snippet which contains a few successive videos frames for local temporal inconsistency learning. Moreover, we elaborately design an Intra-Snippet Inconsistency Module (Intra-SIM) and an Inter-Snippet Interaction Module (Inter- SIM) to establish a dynamic inconsistency modelling framework. Specifically, the Intra-SIM applies bi-directional temporal difference operations and a learnable convolution kernel to mine the short-term motions within each snippet. The Inter-SIM is then devised to promote the cross-snippet information interaction to form global representations. The Intra- SIM and Inter-SIM work in an alternate manner and can be plugged into existing 2D CNNs. Our method outperforms the state of the art competitors on four popular benchmark dataset, i. e. , FaceForensics++, Celeb-DF, DFDC and Wild- Deepfake. Besides, extensive experiments and visualizations are also presented to further illustrate its effectiveness.

PDF Details

NeurIPS Conference 2022 Conference Paper

DENSE: Data-Free One-Shot Federated Learning

Jie Zhang
Chen Chen
Bo Li
Lingjuan Lyu
Shuang Wu
Shouhong Ding
Chunhua Shen
Chao Wu

One-shot Federated Learning (FL) has recently emerged as a promising approach, which allows the central server to learn a model in a single communication round. Despite the low communication cost, existing one-shot FL methods are mostly impractical or face inherent limitations, \eg a public dataset is required, clients' models are homogeneous, and additional data/model information need to be uploaded. To overcome these issues, we propose a novel two-stage \textbf{D}ata-fre\textbf{E} o\textbf{N}e-\textbf{S}hot federated l\textbf{E}arning (DENSE) framework, which trains the global model by a data generation stage and a model distillation stage. DENSE is a practical one-shot FL method that can be applied in reality due to the following advantages: (1) DENSE requires no additional information compared with other methods (except the model parameters) to be transferred between clients and the server; (2) DENSE does not require any auxiliary dataset for training; (3) DENSE considers model heterogeneity in FL, \ie different clients can have different model architectures. Experiments on a variety of real-world datasets demonstrate the superiority of our method. For example, DENSE outperforms the best baseline method Fed-ADI by 5. 08\% on CIFAR10 dataset.

PDF Details

AAAI Conference 2022 Conference Paper

Dual Contrastive Learning for General Face Forgery Detection

Ke Sun
Taiping Yao
Shen Chen
Shouhong Ding
Jilin Li
Rongrong Ji

With various facial manipulation techniques arising, face forgery detection has drawn growing attention due to security concerns. Previous works always formulate face forgery detection as a classification problem based on cross-entropy loss, which emphasizes category-level differences rather than the essential discrepancies between real and fake faces, limiting model generalization in unseen domains. To address this issue, we propose a novel face forgery detection framework, named Dual Contrastive Learning (DCL), which specially constructs positive and negative paired data and performs designed contrastive learning at different granularities to learn generalized feature representation. Concretely, combined with the hard sample selection strategy, Inter-Instance Contrastive Learning (Inter-ICL) is first proposed to promote task-related discriminative features learning by especially constructing instance pairs. Moreover, to further explore the essential discrepancies, Intra-Instance Contrastive Learning (Intra-ICL) is introduced to focus on the local content inconsistencies prevalent in the forged faces by constructing localregion pairs inside instances. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art competitors. Our Code is available at https: //github. com/Tencent/TFace. git.

PDF Details

AAAI Conference 2022 Conference Paper

Exploiting Fine-Grained Face Forgery Clues via Progressive Enhancement Learning

Qiqi Gu
Shen Chen
Taiping Yao
Yang Chen
Shouhong Ding
Ran Yi

With the rapid development of facial forgery techniques, forgery detection has attracted more and more attention due to security concerns. Existing approaches attempt to use frequency information to mine subtle artifacts under high-quality forged faces. However, the exploitation of frequency information is coarse-grained, and more importantly, their vanilla learning process struggles to extract fine-grained forgery traces. To address this issue, we propose a progressive enhancement learning framework to exploit both the RGB and fine-grained frequency clues. Specifically, we perform a fine-grained decomposition of RGB images to completely decouple the real and fake traces in the frequency space. Subsequently, we propose a progressive enhancement learning framework based on a two-branch network, combined with self-enhancement and mutual-enhancement modules. The self-enhancement module captures the traces in different input spaces based on spatial noise enhancement and channel attention. The Mutual-enhancement module concurrently enhances RGB and frequency features by communicating in the shared spatial dimension. The progressive enhancement process facilitates the learning of discriminative features with fine-grained face forgery clues. Extensive experiments on several datasets show that our method outperforms the state-of-the-art face forgery detection methods.

PDF Details

AAAI Conference 2022 Conference Paper

Feature Generation and Hypothesis Verification for Reliable Face Anti-spoofing

Shice Liu
Shitao Lu
Hongyi Xu
Jing Yang
Shouhong Ding
Lizhuang Ma

Although existing face anti-spoofing (FAS) methods achieve high accuracy in intra-domain experiments, their effects drop severely in cross-domain scenarios because of poor generalization. Recently, multifarious techniques have been explored, such as domain generalization and representation disentanglement. However, the improvement is still limited by two issues: 1) It is difficult to perfectly map all faces to a shared feature space. If faces from unknown domains are not mapped to the known region in the shared feature space, accidentally inaccurate predictions will be obtained. 2) It is hard to completely consider various spoof traces for disentanglement. In this paper, we propose a Feature Generation and Hypothesis Verification framework to alleviate the two issues. Above all, feature generation networks which generate hypotheses of real faces and known attacks are introduced for the first time in the FAS task. Subsequently, two hypothesis verification modules are applied to judge whether the input face comes from the real-face space and the real-face distribution respectively. Furthermore, some analyses of the relationship between our framework and Bayesian uncertainty estimation are given, which provides theoretical support for reliable defense in unknown domains. Experimental results show our framework achieves promising results and outperforms the state-of-the-art approaches on extensive public datasets.

PDF Details

ICML Conference 2022 Conference Paper

Federated Learning with Label Distribution Skew via Logits Calibration

Jie Zhang 0081
Zhiqi Li 0004
Bo Li 0115
Jianghe Xu
Shuang Wu 0001
Shouhong Ding
Chao Wu 0001

Traditional federated optimization methods perform poorly with heterogeneous data (i. e. , accuracy reduction), especially for highly skewed data. In this paper, we investigate the label distribution skew in FL, where the distribution of labels varies across clients. First, we investigate the label distribution skew from a statistical view. We demonstrate both theoretically and empirically that previous methods based on softmax cross-entropy are not suitable, which can result in local models heavily overfitting to minority classes and missing classes. Additionally, we theoretically introduce a deviation bound to measure the deviation of the gradient after local update. At last, we propose FedLC (\textbf{Fed}erated learning via \textbf{L}ogits \textbf{C}alibration), which calibrates the logits before softmax cross-entropy according to the probability of occurrence of each class. FedLC applies a fine-grained calibrated cross-entropy loss to local update by adding a pairwise label margin. Extensive experiments on federated datasets and real-world datasets demonstrate that FedLC leads to a more accurate global model and much improved performance. Furthermore, integrating other FL methods into our approach can further enhance the performance of the global model.

Details

IJCAI Conference 2022 Conference Paper

Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection

Zhihao Gu
Taiping Yao
Yang Chen
Ran Yi
Shouhong Ding
Lizhuang Ma

The rapid development of face forgery techniques has drawn growing attention due to security concerns. Existing deepfake video detection methods always attempt to capture the discriminative features by directly exploiting static temporal convolution to mine temporal inconsistency, without explicit exploration on the diverse temporal dynamics of different forged regions. To effectively and comprehensively capture the various inconsistency, in this paper, we propose a novel Region-Aware Temporal Filter (RATF) module which automatically generates corresponding temporal filters for different spatial regions. Specifically, we decouple the dynamic temporal kernel into a set of region-agnostic basic filters and region-sensitive aggregation weights. And different weights guide the corresponding regions to adaptively learn temporal inconsistency, which greatly enhances the overall representational ability. Moreover, to cover the long-term temporal dynamics, we divide the video into multiple snippets and propose a Cross-Snippet Attention (CSA) to promote the cross-snippet information interaction. Extensive experiments and visualizations on several benchmarks demonstrate the effectiveness of our method against state-of-the-art competitors.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition

Bangjie Yin
Wenxuan Wang
Taiping Yao
Junfeng Guo
Zelun Kong
Shouhong Ding
Jilin Li
Cong Liu

Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can realize imperceptible and transferable attack under the black-box setting. Adv-Makeup develops a task-driven makeup generation method with the blending module to synthesize imperceptible eye shadow over the orbital region on faces. And to achieve transferability, Adv-Makeup implements a fine-grained meta-learning based adversarial attack strategy to learn more vulnerable or sensitive features from various models. Compared to existing techniques, sufficient visualization results demonstrate that Adv-Makeup is capable to generate much more imperceptible attacks under both digital and physical scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup can significantly improve the attack success rate under black-box setting, even attacking commercial systems.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Dual Reweighting Domain Generalization for Face Presentation Attack Detection

Shubao Liu
Ke-Yue Zhang
Taiping Yao
Kekai Sheng
Shouhong Ding
Ying Tai
Jilin Li
Yuan Xie

Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing

Zhihong Chen
Taiping Yao
Kekai Sheng
Shouhong Ding
Ying Tai
Jilin Li
Feiyue Huang
Xinyu Jin

Face anti-spoofing approach based on domain generalization (DG) has drawn growing attention due to its robustness for unseen scenarios. Existing DG methods assume that the domain label is known. However, in real-world applications, the collected dataset always contains mixture domains, where the domain label is unknown. In this case, most of existing methods may not work. Further, even if we can obtain the domain label as existing methods, we think this is just a sub-optimal partition. To overcome the limitation, we propose domain dynamic adjustment meta-learning (D2 AM) without using domain labels, which iteratively divides mixture domains via discriminative domain representation and trains a generalizable face anti-spoofing with meta-learning. Specifically, we design a domain feature based on Instance Normalization (IN) and propose a domain representation learning module (DRLM) to extract discriminative domain features for clustering. Moreover, to reduce the side effect of outliers on clustering performance, we additionally utilize maximum mean discrepancy (MMD) to align the distribution of sample features to a prior distribution, which improves the reliability of clustering. Extensive experiments show that the proposed method outperforms conventional DG-based face anti-spoofing methods, including those utilizing domain labels. Furthermore, we enhance the interpretability through visualization.

PDF Details

AAAI Conference 2021 Conference Paper

Local Relation Learning for Face Forgery Detection

Shen Chen
Taiping Yao
Yang Chen
Shouhong Ding
Jilin Li
Rongrong Ji

With the rapid development of facial manipulation techniques, face forgery detection has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classiﬁcation problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these global supervisions are insufﬁcient to learn a generalized feature and prone to overﬁtting. To address this issue, we propose a novel perspective of face forgery detection via local relation learning. Speciﬁcally, we propose a Multi-scale Patch Similarity Module (MPSM), which measures the similarity between features of local regions and forms a robust and generalized similarity pattern. Moreover, we propose an RGB-Frequency Attention Module (RFAM) to fuse information in both RGB and frequency domains for more comprehensive local feature representation, which further improves the reliability of the similarity pattern. Extensive experiments show that the proposed method consistently outperforms the state-of-the-arts on widely-used benchmarks. Furthermore, detailed visualization shows the robustness and interpretability of our method.

PDF Details