Arrow Research search

Author name cluster

Shouhong Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers
2 author rows

Possible papers

42

AAAI Conference 2026 Conference Paper

D²Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

  • Evelyn Zhang
  • Fufu Yu
  • Aoqi Wu
  • Zichen Wen
  • Ke Yan
  • Shouhong Ding
  • Biqing Qi
  • Linfeng Zhang

Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D²Pruner achieves exceptional efficiency and fidelity.

AAAI Conference 2026 Conference Paper

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

  • Xuan Zhao
  • Zhongyu Zhang
  • Yuge Huang
  • Yuxi Mi
  • Guodong Mu
  • Shouhong Ding
  • Jun Wang
  • Rizen Guo

Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module which recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

AAAI Conference 2026 Conference Paper

LSAP-PV: High-Fidelity Palm Vein Image Synthesis via Layered Spectral Absorption Projection-Guided Diffusion Model

  • Sheng Shang
  • Chenglong Zhao
  • Ruixin Zhang
  • Jianlong Jin
  • Jingyun Zhang
  • Jun Wang
  • Yang Zhao
  • Shouhong Ding

Palm vein recognition has emerged as a promising biometric technology, yet its development remains constrained by the scarcity of large-scale publicly available datasets. Several methods of palm vein image generation have been proposed to address this issue. These methods usually focus on the anatomical realism of palm vein patterns, but overlook the biophysical correlation between identities and vein patterns, particularly in simulating identity-specific vein contrast. To tackle this limitation, we propose a novel biophysics-driven synthesis method. Our method constructs a 3D palm vascular tree via established modeling method. Then, a projection model is proposed to map the 3D tree into 2D space to derive palm vein patterns. The projection model is based on skin spectral absorption and simulates the natural attenuation of light passing through the skin using a layer integration method. For different identities, we sample different skin parameters, resulting in varying degrees of attenuation. This method effectively simulates the variation in vein contrast across different identities. Furthermore, we introduce a conditional diffusion model that uses the projected patterns as identity conditions to generate palm vein images. To the best of our knowledge, this is the first palm vein generation method based on the diffusion model. Experimental results demonstrate that our method not only outperforms existing methods, but also enables a recognition model trained on our synthetic data to achieve superior performance compared to a model trained on real-world data at a scale of 2,000 IDs under an open-set protocol with a TAR@FAR=1:1 of 1e-4.

AAAI Conference 2026 Conference Paper

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

  • Yuchen Bao
  • Yiting Wang
  • Wenjian Huang
  • Haowei Wang
  • Shen Chen
  • Taiping Yao
  • Shouhong Ding
  • Jianguo Zhang

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect—such as editing text content—thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the "SCB Group", a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and preventing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent "shortcut" phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer.

NeurIPS Conference 2025 Conference Paper

Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

  • Ruoxin Chen
  • Junwei Xi
  • Zhiyuan Yan
  • Ke-Yue Zhang
  • Shuang Wu
  • Jingyi Xie
  • Xu Chen
  • Lei Xu

The rapid increase in AI-generated images (AIGIs) underscores the need for detection methods. Existing detectors are often trained on biased datasets, leading to overfitting on spurious correlations between non-causal image attributes and real/synthetic labels. While these biased features enhance performance on the training data, they result in substantial performance degradation when tested on unbiased datasets. A common solution is to perform data alignment through generative reconstruction, matching the content between real and synthetic images. However, we find that pixel-level alignment alone is inadequate, as the reconstructed images still suffer from frequency-level misalignment, perpetuating spurious correlations. To illustrate, we observe that reconstruction models restore the high-frequency details lost in real images, inadvertently creating a frequency-level misalignment, where synthetic images appear to have richer high-frequency content than real ones. This misalignment leads to models associating high-frequency features with synthetic labels, further reinforcing biased cues. To resolve this, we propose Dual Data Alignment (DDA), which aligns both the pixel and frequency domains. DDA generates synthetic images that closely resemble real ones by fusing real and synthetic image pairs in both domains, enhancing the detector's ability to identify forgeries without relying on biased features. Moreover, we introduce two new test sets: DDA-COCO, containing DDA-aligned synthetic images, and EvalGEN, featuring the latest generative models. Our extensive evaluations demonstrate that a detector trained exclusively on DDA-aligned MSCOCO improves across diverse benchmarks. Code is available at https: //github. com/roy-ch/Dual-Data-Alignment.

IJCAI Conference 2025 Conference Paper

EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR

  • Zhengyuan Peng
  • Jianqing Xu
  • Shen Li
  • Jiazhen Ji
  • Yuge Huang
  • Jingyun Zhang
  • Jinmin Li
  • Shouhong Ding

Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches.

NeurIPS Conference 2025 Conference Paper

Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes

  • Kaiqing Lin
  • Zhiyuan Yan
  • Ke-Yue Zhang
  • Li Hao
  • Yue Zhou
  • Yuzhen Lin
  • Weixiang Li
  • Taiping Yao

Securing personal identity against deepfake attacks is increasingly critical in the digital age, especially for celebrities and political figures whose faces are easily accessible and frequently targeted. Most existing deepfake detection methods focus on general-purpose scenarios and often ignore the valuable prior knowledge of known facial identities, e. g. , "VIP individuals" whose authentic facial data are already available. In this paper, we propose VIPGuard, a unified multimodal framework designed to capture fine-grained and comprehensive facial representations of a given identity, compare them against potentially fake or similar-looking faces, and reason over these comparisons to make accurate and explainable predictions. Specifically, our framework consists of three main stages. First, we fine-tune a multimodal large language model (MLLM) to learn detailed and structural facial attributes. Second, we perform identity-level discriminative learning to enable the model to distinguish subtle differences between highly similar faces, including real and fake variations. Finally, we introduce user-specific customization, where we model the unique characteristics of the target face identity and perform semantic reasoning via MLLM to enable personalized and explainable deepfake detection. Our framework shows clear advantages over previous detection works, where traditional detectors mainly rely on low-level visual cues and provide no human-understandable explanations, while other MLLM-based models often lack a detailed understanding of specific face identities. To facilitate the evaluation of our method, we build a comprehensive identity-aware benchmark called VIPBench for personalized deepfake detection, involving the latest 7 face-swapping and 7 entire face synthesis techniques for generation. Extensive experiments show that our model outperforms existing methods in both detection and explanation. The code is available at https: //github. com/KQL11/VIPGuard.

AAAI Conference 2025 Conference Paper

Instruct Where the Model Fails: Generative Data Augmentation via Guided Self-contrastive Fine-tuning

  • Weijian Ma
  • Ruoxin Chen
  • Keyue Zhang
  • Shuang Wu
  • Shouhong Ding

Data augmentation is expected to bring about unseen features of training set, enhancing the model’s ability to generalize in situations where data is limited. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with stereotypes and imperceptible bias when used to augment training data, owing to dataset misalignment and the generator’s ignorance of the downstream model. We improve downstream task awareness in generated images by proposing a task-aware fine-tuning strategy that actively detects failures of downstream task in the target model to fine-tune the generation process between epochs. The dynamic fine-tuning strategy is achieved by (1) inspecting misalignment between generated data and original data via VLM captioners and (2) adjusts both prompts and diffusion model so that the strategy dynamically guides the generator by focusing on the detected bias of VLM. This is done via re-captioning the overfitted data as well as finetuning the diffusion trajectory in a contrastive manner. To co-operate with the VLM captioner, the contrastive fine-tuning process dynamically adjusts different parts of the diffusion trajectory based on detected misalignment, thus shifting the the generated distribution away from making the downstream model overfit. Our experiments on few-shot class incremental learning show that our instruction-guided finetuning strategy consistently assists the downstream model with higher classification accuracy compared to generative data augmentation baselines such as Stable Diffusion and GPT-4o, and state-of-the-art non-generative strategies.

ICML Conference 2025 Conference Paper

Large Continual Instruction Assistant

  • Jingyang Qiao
  • Zhizhong Zhang 0001
  • Xin Tan 0002
  • Yanyun Qu
  • Shouhong Ding
  • Yuan Xie 0006

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https: //github. com/JingyangQiao/CoIN.

ICML Conference 2025 Conference Paper

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

  • Zhiyuan Yan 0002
  • Jiangming Wang
  • Peng Jin 0001
  • Ke-Yue Zhang
  • Chengchun Liu
  • Shen Chen 0004
  • Taiping Yao
  • Shouhong Ding

Detecting AI-generated images (AIGIs), such as natural images or face images, has become increasingly important yet challenging. In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named the asymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization. One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model’s overfitting to fake. To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces. By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns. Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality, enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization. We finally identify a crucial insight: our method implicitly learns a vital prior that fakes are actually derived from the real, indicating a hierarchical relationship rather than independence. Modeling this prior, we believe, is essential for achieving superior generalization. Our codes are publicly available at https: //github. com/YZY-stack/Effort-AIGI-Detection.

ICML Conference 2025 Conference Paper

PiD: Generalized AI-Generated Images Detection with Pixelwise Decomposition Residuals

  • Xinghe Fu
  • Zhiyuan Yan 0002
  • Zheng Yang
  • Taiping Yao
  • Yandan Zhao
  • Shouhong Ding
  • Xi Li 0001

Fake images, created by recently advanced generative models, have become increasingly indistinguishable from real ones, making their detection crucial, urgent, and challenging. This paper introduces PiD (Pixelwise Decomposition Residuals), a novel detection method that focuses on residual signals within images. Generative models are designed to optimize high-level semantic content (principal components), often overlooking low-level signals (residual components). PiD leverages this observation by disentangling residual components from images, encouraging the model to uncover more underlying and general forgery clues independent of semantic content. Compared to prior approaches that rely on reconstruction techniques or high-frequency information, PiD is computationally efficient and does not rely on any generative models for reconstruction. Specifically, PiD operates at the pixel level, mapping the pixel vector to another color space (e. g. , YUV) and then quantizing the vector. The pixel vector is mapped back to the RGB space and the quantization loss is taken as the residual for AIGC detection. Our experiment results are striking and highly surprising: PiD achieves 98% accuracy on the widely used GenImage benchmark, highlighting the effectiveness and generalization performance.

AAAI Conference 2025 Conference Paper

PVTree: Realistic and Controllable Palm Vein Generation for Recognition Tasks

  • Sheng Shang
  • Chenglong Zhao
  • Ruixin Zhang
  • Jianlong Jin
  • Jingyun Zhang
  • Rizen Guo
  • Shouhong Ding
  • Yunsheng Wu

Palm vein recognition is an emerging biometric technology that offers enhanced security and privacy. However, acquiring sufficient palm vein data for training deep learning-based recognition models is challenging due to the high costs of data collection and privacy protection constraints. This has led to a growing interest in generating pseudo-palm vein data using generative models. Existing methods, however, often produce unrealistic palm vein patterns or struggle with controlling identity and style attributes. To address these issues, we propose a novel palm vein generation framework named PVTree. First, the palm vein identity is defined by a complex and authentic 3D palm vascular tree, created using an improved Constrained Constructive Optimization (CCO) algorithm. Second, palm vein patterns of the same identity are generated by projecting the same 3D vascular tree into 2D images from different views and converting them into realistic images using a generative model. As a result, PVTree satisfies the need for both identity consistency and intra-class diversity. Extensive experiments conducted on several publicly available datasets demonstrate that our proposed palm vein generation method surpasses existing methods and achieves a higher TAR@FAR=1e-4 under the 1:1 Open-set protocol. To the best of our knowledge, this is the first time that the performance of a recognition model trained on synthetic palm vein data exceeds that of the recognition model trained on real data, which indicates that palm vein image generation research has a promising future.

AAAI Conference 2025 Conference Paper

SlerpFace: Face Template Protection via Spherical Linear Interpolation

  • Zhizhou Zhong
  • Yuxi Mi
  • Yuge Huang
  • Jianqing Xu
  • Guodong Mu
  • Shouhong Ding
  • Jingyun Zhang
  • Rizen Guo

Contemporary face recognition systems use feature templates extracted from face images to identify persons. To enhance privacy, face template protection techniques are widely employed to conceal sensitive identity and appearance information stored in the template. This paper identifies an emerging privacy attack form utilizing diffusion models that could nullify prior protection. The attack can synthesize high-quality, identity-preserving face images from templates, revealing persons' appearance. Based on studies of the diffusion model's generative capability, this paper proposes a defense by rotating templates to a noise-like distribution. This is achieved efficiently by spherically and linearly interpolating templates on their located hypersphere. This paper further proposes to group-wisely divide and drop out templates' feature dimensions, to enhance the irreversibility of rotated templates. The proposed techniques are concretized as a novel face template protection technique, SlerpFace. Extensive experiments show that SlerpFace provides satisfactory recognition accuracy and comprehensive protection against inversion and other attack forms, superior to prior arts.

NeurIPS Conference 2025 Conference Paper

Switchable Token-Specific Codebook Quantization For Face Image Compression

  • Yongbo Wang
  • Haonan Wang
  • Guodong Mu
  • Ruixin Zhang
  • Jiaqi Chen
  • Jingyun Zhang
  • Jun Wang
  • Yuan Xie

With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images—which are rich in attributes—such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93. 51\% for reconstructed images at 0. 05 bpp.

ICLR Conference 2025 Conference Paper

ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts

  • Yuanchen Wu
  • Junlong Du
  • Ke Yan
  • Shouhong Ding
  • Xiaoqiang Li 0002

Vision-language (VL) learning requires extensive visual perception capabilities, such as fine-grained object recognition and spatial perception. Recent works typically rely on training huge models on massive datasets to develop these capabilities. As a more efficient alternative, this paper proposes a new framework that Transfers the knowledge from a hub of Vision Experts (ToVE) for efficient VL learning, leveraging pre-trained vision expert models to promote visual perception capability. Specifically, building on a frozen CLIP image encoder that provides vision tokens for image-conditioned language generation, ToVE introduces a hub of multiple vision experts and a token-aware gating network that dynamically routes expert knowledge to vision tokens. In the transfer phase, we propose a "residual knowledge transfer" strategy, which not only preserves the generalizability of the vision tokens but also allows selective detachment of low-contributing experts to improve inference efficiency. Further, we explore to merge these expert knowledge to a single CLIP encoder, creating a knowledge-merged CLIP that produces more informative vision tokens without expert inference during deployment. Experiment results across various VL tasks demonstrate that the proposed ToVE achieves competitive performance with two orders of magnitude fewer training data.

ICML Conference 2025 Conference Paper

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

  • Yuanchen Wu
  • Ke Yan
  • Shouhong Ding
  • Ziyin Zhou
  • Xiaoqiang Li 0002

Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

ICLR Conference 2025 Conference Paper

UIFace: Unleashing Inherent Model Capabilities to Enhance Intra-Class Diversity in Synthetic Face Recognition

  • Xiao Lin
  • Yuge Huang
  • Jianqing Xu
  • Yuxi Mi
  • Shuigeng Zhou
  • Shouhong Ding

Face recognition (FR) stands as one of the most crucial applications in computer vision. The accuracy of FR models has significantly improved in recent years due to the availability of large-scale human face datasets. However, directly using these datasets can inevitably lead to privacy and legal problems. Generating synthetic data to train FR models is a feasible solution to circumvent these issues. While existing synthetic-based face recognition methods have made significant progress in generating identity-preserving images, they are severely plagued by context overfitting, resulting in a lack of intra-class diversity of generated images and poor face recognition performance. In this paper, we propose a framework to $\textbf{U}$nleash model $\textbf{I}$nherent capabilities to enhance intra-class diversity for synthetic face recognition, shorted as $\textbf{UIFace}$. Our framework first train a diffusion model that can perform denoising conditioned on either identity contexts or a learnable empty context. The former generates identity-preserving images but lacks variations, while the latter exploits the model's intrinsic ability to synthesize intra-class-diversified images but with random identities. Then we adopt a novel two-stage denoising strategy to fully leverage the strengths of both type of contexts, resulting in images that are diverse as well as identity-preserving. Moreover, an attention injection module is introduced to further augment the intra-class variations by utilizing attention maps from the empty context to guide the denoising process in ID-conditioned generation. Experiments show that our method significantly surpasses previous approaches with even less training data and half the size of synthetic dataset. More surprisingly, the proposed $\textbf{UIFace}$ even achieves comparable performance of FR models trained on real datasets when we increase the number of synthetic identities.

NeurIPS Conference 2024 Conference Paper

$\text{ID}^3$: Identity-Preserving-yet-Diversified Diffusion Models for Synthetic Face Recognition

  • Jianqing Xu
  • Shen Li
  • Jiaying Wu
  • Miao Xiong
  • Ailin Deng
  • Jiazhen Ji
  • Yuge Huang
  • Guodong Mu

Synthetic face recognition (SFR) aims to generate synthetic face datasets that mimic the distribution of real face data, which allows for training face recognition models in a privacy-preserving manner. Despite the remarkable potential of diffusion models in image generation, current diffusion-based SFR models struggle with generalization to real-world faces. To address this limitation, we outline three key objectives for SFR: (1) promoting diversity across identities (inter-class diversity), (2) ensuring diversity within each identity by injecting various facial attributes (intra-class diversity), and (3) maintaining identity consistency within each identity group (intra-class identity preservation). Inspired by these goals, we introduce a diffusion-fueled SFR model termed $\text{ID}^3$. $\text{ID}^3$ employs an ID-preserving loss to generate diverse yet identity-consistent facial appearances. Theoretically, we show that minimizing this loss is equivalent to maximizing the lower bound of an adjusted conditional log-likelihood over ID-preserving data. This equivalence motivates an ID-preserving sampling algorithm, which operates over an adjusted gradient vector field, enabling the generation of fake face recognition datasets that approximate the distribution of real-world faces. Extensive experiments across five challenging benchmarks validate the advantages of $\text{ID}^3$.

NeurIPS Conference 2024 Conference Paper

DF40: Toward Next-Generation Deepfake Detection

  • Zhiyuan Yan
  • Taiping Yao
  • Shen Chen
  • Yandan Zhao
  • Xinghe Fu
  • Junwei Zhu
  • Donghao Luo
  • Chengjie Wang

We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset ( e. g. , FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass" for navigating SoTA detectors. But can these stand-out "winners" be truly applied to tackle the myriad of realistic and diverse deepfakes lurking in the real world? If not, what underlying factors contribute to this gap? In this work, we found the dataset (both train and test) can be the "primary culprit" due to the following: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC, especially face). Most existing datasets only contain partial types of them, with limited forgery methods implemented ( e. g. , 2 swapping and 2 reenactment methods in FF++); (2) forgery realism: The dominated training dataset, FF++, contains out-of-date forgery techniques from the past four years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection generalization toward nowadays' SoTA deepfakes; (3) evaluation protocol: Most detection works perform evaluations on one type, e. g. , face-swapping types only, which hinders the development of universal deepfake detectors. To address this dilemma, we construct a highly diverse and large-scale deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques (10 times larger than FF++). We then conduct comprehensive evaluations using 4 standard evaluation protocols and 8 representative detection methods, resulting in over 2, 000 evaluations. Through these evaluations, we provide an extensive analysis from various perspectives, leading to 7 new insightful findings contributing to the field. We also open up 4 valuable yet previously underexplored research questions to inspire future works. We release our dataset, code, and pre-trained weights at https: //github. com/YZY-stack/DF40.

NeurIPS Conference 2024 Conference Paper

DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion

  • Ke Sun
  • Shen Chen
  • Taiping Yao
  • Hong Liu
  • Xiaoshuai Sun
  • Shouhong Ding
  • Rongrong Ji

The rapid progress of Deepfake technology has made face swapping highly realistic, raising concerns about the malicious use of fabricated facial content. Existing methods often struggle to generalize to unseen domains due to the diverse nature of facial manipulations. In this paper, we revisit the generation process and identify a universal principle: Deepfake images inherently contain information from both source and target identities, while genuine faces maintain a consistent identity. Building upon this insight, we introduce DiffusionFake, a novel plug-and-play framework that reverses the generative process of face forgeries to enhance the generalization of detection models. DiffusionFake achieves this by injecting the features extracted by the detection model into a frozen pre-trained Stable Diffusion model, compelling it to reconstruct the corresponding target and source images. This guided reconstruction process constrains the detection network to capture the source and target related features to facilitate the reconstruction, thereby learning rich and disentangled representations that are more resilient to unseen forgeries. Extensive experiments demonstrate that DiffusionFake significantly improves cross-domain generalization of various detector architectures without introducing additional parameters during inference. The code are available in https: //github. com/skJack/DiffusionFake. git.

AAAI Conference 2024 Conference Paper

Domain-Hallucinated Updating for Multi-Domain Face Anti-spoofing

  • Chengyang Hu
  • Ke-Yue Zhang
  • Taiping Yao
  • Shice Liu
  • Shouhong Ding
  • Xin Tan
  • Lizhuang Ma

Multi-Domain Face Anti-Spoofing (MD-FAS) is a practical setting that aims to update models on new domains using only novel data while ensuring that the knowledge acquired from previous domains is not forgotten. Prior methods utilize the responses from models to represent the previous domain knowledge or map the different domains into separated feature spaces to prevent forgetting. However, due to domain gaps, the responses of new data are not as accurate as those of previous data. Also, without the supervision of previous data, separated feature spaces might be destroyed by new domains while updating, leading to catastrophic forgetting. Inspired by the challenges posed by the lack of previous data, we solve this issue from a new standpoint that generates hallucinated previous data for updating FAS model. To this end, we propose a novel Domain-Hallucinated Updating (DHU) framework to facilitate the hallucination of data. Specifically, Domain Information Explorer learns representative domain information of the previous domains. Then, Domain Information Hallucination module transfers the new domain data to pseudo-previous domain ones. Moreover, Hallucinated Features Joint Learning module is proposed to asymmetrically align the new and pseudo-previous data for real samples via dual levels to learn more generalized features, promoting the results on all domains. Our experimental results and visualizations demonstrate that the proposed method outperforms state-of-the-art competitors in terms of effectiveness.

AAAI Conference 2024 Conference Paper

HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting

  • Qihe Huang
  • Lei Shen
  • Ruixin Zhang
  • Jiahuan Cheng
  • Shouhong Ding
  • Zhengyang Zhou
  • Yang Wang

Multivariate time series (MTS) prediction has been widely adopted in various scenarios. Recently, some methods have employed patching to enhance local semantics and improve model performance. However, length-fixed patch are prone to losing temporal boundary information, such as complete peaks and periods. Moreover, existing methods mainly focus on modeling long-term dependencies across patches, while paying little attention to other dimensions (e.g., short-term dependencies within patches and complex interactions among cross-variavle patches). To address these challenges, we propose a pure MLP-based HDMixer, aiming to acquire patches with richer semantic information and efficiently modeling hierarchical interactions. Specifically, we design a Length-Extendable Patcher (LEP) tailored to MTS, which enriches the boundary information of patches and alleviates semantic incoherence in series. Subsequently, we devise a Hierarchical Dependency Explorer (HDE) based on pure MLPs. This explorer effectively models short-term dependencies within patches, long-term dependencies across patches, and complex interactions among variables. Extensive experiments on 9 real-world datasets demonstrate the superiority of our approach. The code is available at https://github.com/hqh0728/HDMixer.

AAAI Conference 2024 Conference Paper

MmAP: Multi-Modal Alignment Prompt for Cross-Domain Multi-Task Learning

  • Yi Xin
  • Junlong Du
  • Qiang Wang
  • Ke Yan
  • Shouhong Ding

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific decoders. However, the complexity of the decoders increases with the number of tasks. To tackle this challenge, we integrate the decoder-free vision-language model CLIP, which exhibits robust zero-shot generalization capability. Recently, parameter-efficient transfer learning methods have been extensively explored with CLIP for adapting to downstream tasks, where prompt tuning showcases strong potential. Nevertheless, these methods solely fine-tune a single modality (text or visual), disrupting the modality structure of CLIP. In this paper, we first propose Multi-modal Alignment Prompt (MmAP) for CLIP, which aligns text and visual modalities during fine-tuning process. Building upon MmAP, we develop an innovative multi-task prompt learning framework. On the one hand, to maximize the complementarity of tasks with high similarity, we utilize a gradient-driven task grouping method that partitions tasks into several disjoint groups and assign a group-shared MmAP to each group. On the other hand, to preserve the unique characteristics of each task, we assign an task-specific MmAP to each task. Comprehensive experiments on two large multi-task learning datasets demonstrate that our method achieves significant performance improvements compared to full fine-tuning while only utilizing approximately ~ 0.09% of trainable parameters.

ICML Conference 2024 Conference Paper

Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models

  • Didi Zhu
  • Zhongyi Sun 0002
  • Zexi Li 0001
  • Tao Shen 0002
  • Ke Yan
  • Shouhong Ding
  • Chao Wu 0001
  • Kun Kuang 0001

Catastrophic forgetting emerges as a critical challenge when fine-tuning multi-modal large language models (MLLMs), where improving performance on unseen tasks often leads to a significant performance drop on the original tasks. This paper presents a comprehensive analysis of catastrophic forgetting in MLLMs and introduces a post-training adjustment method called Model Tailor. Our method primarily preserves the pre-trained parameters while replacing a small number ($\leq$ 10%) of fine-tuned parameters, maintaining $\sim$ 99% effectiveness on original tasks versus pre-training, and achieving $\sim$ 97% on new tasks compared to standard fine-tuning. Specifically, we derive a sparse mask to identify the model patch, based on a fusion strategy that integrates salience and sensitivity analysis. Subsequently, a compensation mechanism is introduced to decorate the patch, enhancing the model’s performance on both target and original tasks. Additionally, our method is adaptable to multi-task scenarios. Through extensive experiments on InstructBLIP and LLaVA-1. 5 in both image captioning and visual question answering tasks, our approach demonstrates significant task adaptability while preserving inherent pre-trained capabilities.

AAAI Conference 2024 Conference Paper

PCE-Palm: Palm Crease Energy Based Two-Stage Realistic Pseudo-Palmprint Generation

  • Jianlong Jin
  • Lei Shen
  • Ruixin Zhang
  • Chenglong Zhao
  • Ge Jin
  • Jingyun Zhang
  • Shouhong Ding
  • Yang Zhao

The lack of large-scale data seriously hinders the development of palmprint recognition. Recent approaches address this issue by generating large-scale realistic pseudo palmprints from Bézier curves. However, the significant difference between Bézier curves and real palmprints limits their effectiveness. In this paper, we divide the Bézier-Real difference into creases and texture differences, thus reducing the generation difficulty. We introduce a new palm crease energy (PCE) domain as a bridge from Bézier curves to real palmprints and propose a two-stage generation model. The first stage generates PCE images (realistic creases) from Bézier curves, and the second stage outputs realistic palmprints (realistic texture) with PCE images as input. In addition, we also design a lightweight plug-and-play line feature enhancement block to facilitate domain transfer and improve recognition performance. Extensive experimental results demonstrate that the proposed method surpasses state-of-the-art methods. Under extremely few data settings like 40 IDs (only 2.5% of the total training set), our model achieves a 29% improvement over RPG-Palm and outperforms ArcFace with 100% training set by more than 6% in terms of TAR@FAR=1e-6.

NeurIPS Conference 2024 Conference Paper

SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models

  • Linglan Zhao
  • Xuerui Zhang
  • Ke Yan
  • Shouhong Ding
  • Weiran Huang

Continual learning aims to incrementally acquire new concepts in data streams while resisting forgetting previous knowledge. With the rise of powerful pre-trained models (PTMs), there is a growing interest in training incremental learning systems using these foundation models, rather than learning from scratch. Existing works often view PTMs as a strong initial point and directly apply parameter-efficient tuning (PET) in the first session for adapting to downstream tasks. In the following sessions, most methods freeze model parameters for tackling forgetting issues. However, applying PET directly to downstream data cannot fully explore the inherent knowledge in PTMs. Additionally, freezing the parameters in incremental sessions hinders models' plasticity to novel concepts not covered in the first session. To solve the above issues, we propose a Slow And Fast parameter-Efficient tuning (SAFE) framework. In particular, to inherit general knowledge from foundation models, we include a transfer loss function by measuring the correlation between the PTM and the PET-applied model. After calibrating in the first session, the slow efficient tuning parameters can capture more informative features, improving generalization to incoming classes. Moreover, to further incorporate novel concepts, we strike a balance between stability and plasticity by fixing slow efficient tuning parameters and continuously updating the fast ones. Specifically, a cross-classification loss with feature alignment is proposed to circumvent catastrophic forgetting. During inference, we introduce an entropy-based aggregation strategy to dynamically utilize the complementarity in the slow and fast learners. Extensive experiments on seven benchmark datasets verify the effectiveness of our method by significantly surpassing the state-of-the-art.

AAAI Conference 2023 Conference Paper

Attack Can Benefit: An Adversarial Approach to Recognizing Facial Expressions under Noisy Annotations

  • Jiawen Zheng
  • Bo Li
  • Shengchuan Zhang
  • Shuang Wu
  • Liujuan Cao
  • Shouhong Ding

The real-world Facial Expression Recognition (FER) datasets usually exhibit complex scenarios with coupled noise annotations and imbalanced classes distribution, which undoubtedly impede the development of FER methods. To address the aforementioned issues, in this paper, we propose a novel and flexible method to spot noisy labels by leveraging adversarial attack, termed as Geometry Aware Adversarial Vulnerability Estimation (GAAVE). Different from existing state-of-the-art methods of noisy label learning (NLL), our method has no reliance on additional information and is thus easy to generalize to the large-scale real-world FER datasets. Besides, the combination of Dataset Splitting module and Subset Refactoring module mitigates the impact of class imbalance, and the Self-Annotator module facilitates the sufficient use of all training data. Extensive experiments on RAF-DB, FERPlus, AffectNet, and CIFAR-10 datasets validate the effectiveness of our method. The stabilized enhancement based on different methods demonstrates the flexibility of our proposed GAAVE.

NeurIPS Conference 2023 Conference Paper

Content-based Unrestricted Adversarial Attack

  • Zhaoyu Chen
  • Bo Li
  • Shuang Wu
  • Kaixun Jiang
  • Shouhong Ding
  • Wenqiang Zhang

Unrestricted adversarial attacks typically manipulate the semantic content of an image (e. g. , color or texture) to create adversarial examples that are both effective and photorealistic, demonstrating their ability to deceive human perception and deep neural networks with stealth and success. However, current works usually sacrifice unrestricted degrees and subjectively select some image content to guarantee the photorealism of unrestricted adversarial examples, which limits its attack performance. To ensure the photorealism of adversarial examples and boost attack performance, we propose a novel unrestricted attack framework called Content-based Unrestricted Adversarial Attack. By leveraging a low-dimensional manifold that represents natural images, we map the images onto the manifold and optimize them along its adversarial direction. Therefore, within this framework, we implement Adversarial Content Attack (ACA) based on Stable Diffusion and can generate high transferable unrestricted adversarial examples with various adversarial contents. Extensive experimentation and visualization demonstrate the efficacy of ACA, particularly in surpassing state-of-the-art attacks by an average of 13. 3-50. 4\% and 16. 8-48. 0\% in normally trained models and defense methods, respectively.

NeurIPS Conference 2023 Conference Paper

CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement

  • Qihe Huang
  • Lei Shen
  • Ruixin Zhang
  • Shouhong Ding
  • Binwu Wang
  • Zhengyang Zhou
  • Yang Wang

Recently, multivariate time series (MTS) forecasting techniques have seen rapid development and widespread applications across various fields. Transformer-based and GNN-based methods have shown promising potential due to their strong ability to model interaction of time and variables. However, by conducting a comprehensive analysis of the real-world data, we observe that the temporal fluctuations and heterogeneity between variables are not well handled by existing methods. To address the above issues, we propose CrossGNN, a linear complexity GNN model to refine the cross-scale and cross-variable interaction for MTS. To deal with the unexpected noise in time dimension, an adaptive multi-scale identifier (AMSI) is leveraged to construct multi-scale time series with reduced noise. A Cross-Scale GNN is proposed to extract the scales with clearer trend and weaker noise. Cross-Variable GNN is proposed to utilize the homogeneity and heterogeneity between different variables. By simultaneously focusing on edges with higher saliency scores and constraining those edges with lower scores, the time and space complexity (i. e. , $O(L)$) of CrossGNN can be linear with the input sequence length $L$. Extensive experimental results on 8 real-world MTS datasets demonstrate the effectiveness of CrossGNN compared with state-of-the-art methods.

AAAI Conference 2023 Conference Paper

Delving into the Adversarial Robustness of Federated Learning

  • Jie Zhang
  • Bo Li
  • Chen Chen
  • Lingjuan Lyu
  • Shuang Wu
  • Shouhong Ding
  • Chao Wu

In Federated Learning (FL), models are as fragile as centrally trained models against adversarial examples. However, the adversarial robustness of federated learning remains largely unexplored. This paper casts light on the challenge of adversarial robustness of federated learning. To facilitate a better understanding of the adversarial vulnerability of the existing FL methods, we conduct comprehensive robustness evaluations on various attacks and adversarial training methods. Moreover, we reveal the negative impacts induced by directly adopting adversarial training in FL, which seriously hurts the test accuracy, especially in non-IID settings. In this work, we propose a novel algorithm called Decision Boundary based Federated Adversarial Training (DBFAT), which consists of two components (local re-weighting and global regularization) to improve both accuracy and robustness of FL systems. Extensive experiments on multiple datasets demonstrate that DBFAT consistently outperforms other baselines under both IID and non-IID settings.

NeurIPS Conference 2022 Conference Paper

Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition

  • Shuai Jia
  • Bangjie Yin
  • Taiping Yao
  • Shouhong Ding
  • Chunhua Shen
  • Xiaokang Yang
  • Chao Ma

Deep learning models have shown their vulnerability when dealing with adversarial attacks. Existing attacks almost perform on low-level instances, such as pixels and super-pixels, and rarely exploit semantic clues. For face recognition attacks, existing methods typically generate the l_p-norm perturbations on pixels, however, resulting in low attack transferability and high vulnerability to denoising defense models. In this work, instead of performing perturbations on the low-level pixels, we propose to generate attacks through perturbing on the high-level semantics to improve attack transferability. Specifically, a unified flexible framework, Adversarial Attributes (Adv-Attribute), is designed to generate inconspicuous and transferable attacks on face recognition, which crafts the adversarial noise and adds it into different attributes based on the guidance of the difference in face recognition features from the target. Moreover, the importance-aware attribute selection and the multi-objective optimization strategy are introduced to further ensure the balance of stealthiness and attacking strength. Extensive experiments on the FFHQ and CelebA-HQ datasets show that the proposed Adv-Attribute method achieves the state-of-the-art attacking success rates while maintaining better visual effects against recent attack methods.

AAAI Conference 2022 Conference Paper

Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection

  • Zhihao Gu
  • Yang Chen
  • Taiping Yao
  • Shouhong Ding
  • Jilin Li
  • Lizhuang Ma

The rapid development of facial manipulation techniques has aroused public concerns in recent years. Existing deepfake video detection approaches attempt to capture the discriminative features between real and fake faces based on temporal modelling. However, these works impose supervisions on sparsely sampled video frames but overlook the local motions among adjacent frames, which instead encode rich inconsistency information that can serve as an efficient indicator for DeepFake video detection. To mitigate this issue, we delves into the local motion and propose a novel sampling unit named snippet which contains a few successive videos frames for local temporal inconsistency learning. Moreover, we elaborately design an Intra-Snippet Inconsistency Module (Intra-SIM) and an Inter-Snippet Interaction Module (Inter- SIM) to establish a dynamic inconsistency modelling framework. Specifically, the Intra-SIM applies bi-directional temporal difference operations and a learnable convolution kernel to mine the short-term motions within each snippet. The Inter-SIM is then devised to promote the cross-snippet information interaction to form global representations. The Intra- SIM and Inter-SIM work in an alternate manner and can be plugged into existing 2D CNNs. Our method outperforms the state of the art competitors on four popular benchmark dataset, i. e. , FaceForensics++, Celeb-DF, DFDC and Wild- Deepfake. Besides, extensive experiments and visualizations are also presented to further illustrate its effectiveness.

NeurIPS Conference 2022 Conference Paper

DENSE: Data-Free One-Shot Federated Learning

  • Jie Zhang
  • Chen Chen
  • Bo Li
  • Lingjuan Lyu
  • Shuang Wu
  • Shouhong Ding
  • Chunhua Shen
  • Chao Wu

One-shot Federated Learning (FL) has recently emerged as a promising approach, which allows the central server to learn a model in a single communication round. Despite the low communication cost, existing one-shot FL methods are mostly impractical or face inherent limitations, \eg a public dataset is required, clients' models are homogeneous, and additional data/model information need to be uploaded. To overcome these issues, we propose a novel two-stage \textbf{D}ata-fre\textbf{E} o\textbf{N}e-\textbf{S}hot federated l\textbf{E}arning (DENSE) framework, which trains the global model by a data generation stage and a model distillation stage. DENSE is a practical one-shot FL method that can be applied in reality due to the following advantages: (1) DENSE requires no additional information compared with other methods (except the model parameters) to be transferred between clients and the server; (2) DENSE does not require any auxiliary dataset for training; (3) DENSE considers model heterogeneity in FL, \ie different clients can have different model architectures. Experiments on a variety of real-world datasets demonstrate the superiority of our method. For example, DENSE outperforms the best baseline method Fed-ADI by 5. 08\% on CIFAR10 dataset.

AAAI Conference 2022 Conference Paper

Dual Contrastive Learning for General Face Forgery Detection

  • Ke Sun
  • Taiping Yao
  • Shen Chen
  • Shouhong Ding
  • Jilin Li
  • Rongrong Ji

With various facial manipulation techniques arising, face forgery detection has drawn growing attention due to security concerns. Previous works always formulate face forgery detection as a classification problem based on cross-entropy loss, which emphasizes category-level differences rather than the essential discrepancies between real and fake faces, limiting model generalization in unseen domains. To address this issue, we propose a novel face forgery detection framework, named Dual Contrastive Learning (DCL), which specially constructs positive and negative paired data and performs designed contrastive learning at different granularities to learn generalized feature representation. Concretely, combined with the hard sample selection strategy, Inter-Instance Contrastive Learning (Inter-ICL) is first proposed to promote task-related discriminative features learning by especially constructing instance pairs. Moreover, to further explore the essential discrepancies, Intra-Instance Contrastive Learning (Intra-ICL) is introduced to focus on the local content inconsistencies prevalent in the forged faces by constructing localregion pairs inside instances. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art competitors. Our Code is available at https: //github. com/Tencent/TFace. git.

AAAI Conference 2022 Conference Paper

Exploiting Fine-Grained Face Forgery Clues via Progressive Enhancement Learning

  • Qiqi Gu
  • Shen Chen
  • Taiping Yao
  • Yang Chen
  • Shouhong Ding
  • Ran Yi

With the rapid development of facial forgery techniques, forgery detection has attracted more and more attention due to security concerns. Existing approaches attempt to use frequency information to mine subtle artifacts under high-quality forged faces. However, the exploitation of frequency information is coarse-grained, and more importantly, their vanilla learning process struggles to extract fine-grained forgery traces. To address this issue, we propose a progressive enhancement learning framework to exploit both the RGB and fine-grained frequency clues. Specifically, we perform a fine-grained decomposition of RGB images to completely decouple the real and fake traces in the frequency space. Subsequently, we propose a progressive enhancement learning framework based on a two-branch network, combined with self-enhancement and mutual-enhancement modules. The self-enhancement module captures the traces in different input spaces based on spatial noise enhancement and channel attention. The Mutual-enhancement module concurrently enhances RGB and frequency features by communicating in the shared spatial dimension. The progressive enhancement process facilitates the learning of discriminative features with fine-grained face forgery clues. Extensive experiments on several datasets show that our method outperforms the state-of-the-art face forgery detection methods.

AAAI Conference 2022 Conference Paper

Feature Generation and Hypothesis Verification for Reliable Face Anti-spoofing

  • Shice Liu
  • Shitao Lu
  • Hongyi Xu
  • Jing Yang
  • Shouhong Ding
  • Lizhuang Ma

Although existing face anti-spoofing (FAS) methods achieve high accuracy in intra-domain experiments, their effects drop severely in cross-domain scenarios because of poor generalization. Recently, multifarious techniques have been explored, such as domain generalization and representation disentanglement. However, the improvement is still limited by two issues: 1) It is difficult to perfectly map all faces to a shared feature space. If faces from unknown domains are not mapped to the known region in the shared feature space, accidentally inaccurate predictions will be obtained. 2) It is hard to completely consider various spoof traces for disentanglement. In this paper, we propose a Feature Generation and Hypothesis Verification framework to alleviate the two issues. Above all, feature generation networks which generate hypotheses of real faces and known attacks are introduced for the first time in the FAS task. Subsequently, two hypothesis verification modules are applied to judge whether the input face comes from the real-face space and the real-face distribution respectively. Furthermore, some analyses of the relationship between our framework and Bayesian uncertainty estimation are given, which provides theoretical support for reliable defense in unknown domains. Experimental results show our framework achieves promising results and outperforms the state-of-the-art approaches on extensive public datasets.

ICML Conference 2022 Conference Paper

Federated Learning with Label Distribution Skew via Logits Calibration

  • Jie Zhang 0081
  • Zhiqi Li 0004
  • Bo Li 0115
  • Jianghe Xu
  • Shuang Wu 0001
  • Shouhong Ding
  • Chao Wu 0001

Traditional federated optimization methods perform poorly with heterogeneous data (i. e. , accuracy reduction), especially for highly skewed data. In this paper, we investigate the label distribution skew in FL, where the distribution of labels varies across clients. First, we investigate the label distribution skew from a statistical view. We demonstrate both theoretically and empirically that previous methods based on softmax cross-entropy are not suitable, which can result in local models heavily overfitting to minority classes and missing classes. Additionally, we theoretically introduce a deviation bound to measure the deviation of the gradient after local update. At last, we propose FedLC (\textbf{Fed}erated learning via \textbf{L}ogits \textbf{C}alibration), which calibrates the logits before softmax cross-entropy according to the probability of occurrence of each class. FedLC applies a fine-grained calibrated cross-entropy loss to local update by adding a pairwise label margin. Extensive experiments on federated datasets and real-world datasets demonstrate that FedLC leads to a more accurate global model and much improved performance. Furthermore, integrating other FL methods into our approach can further enhance the performance of the global model.

IJCAI Conference 2022 Conference Paper

Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection

  • Zhihao Gu
  • Taiping Yao
  • Yang Chen
  • Ran Yi
  • Shouhong Ding
  • Lizhuang Ma

The rapid development of face forgery techniques has drawn growing attention due to security concerns. Existing deepfake video detection methods always attempt to capture the discriminative features by directly exploiting static temporal convolution to mine temporal inconsistency, without explicit exploration on the diverse temporal dynamics of different forged regions. To effectively and comprehensively capture the various inconsistency, in this paper, we propose a novel Region-Aware Temporal Filter (RATF) module which automatically generates corresponding temporal filters for different spatial regions. Specifically, we decouple the dynamic temporal kernel into a set of region-agnostic basic filters and region-sensitive aggregation weights. And different weights guide the corresponding regions to adaptively learn temporal inconsistency, which greatly enhances the overall representational ability. Moreover, to cover the long-term temporal dynamics, we divide the video into multiple snippets and propose a Cross-Snippet Attention (CSA) to promote the cross-snippet information interaction. Extensive experiments and visualizations on several benchmarks demonstrate the effectiveness of our method against state-of-the-art competitors.

IJCAI Conference 2021 Conference Paper

Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition

  • Bangjie Yin
  • Wenxuan Wang
  • Taiping Yao
  • Junfeng Guo
  • Zelun Kong
  • Shouhong Ding
  • Jilin Li
  • Cong Liu

Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can realize imperceptible and transferable attack under the black-box setting. Adv-Makeup develops a task-driven makeup generation method with the blending module to synthesize imperceptible eye shadow over the orbital region on faces. And to achieve transferability, Adv-Makeup implements a fine-grained meta-learning based adversarial attack strategy to learn more vulnerable or sensitive features from various models. Compared to existing techniques, sufficient visualization results demonstrate that Adv-Makeup is capable to generate much more imperceptible attacks under both digital and physical scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup can significantly improve the attack success rate under black-box setting, even attacking commercial systems.

IJCAI Conference 2021 Conference Paper

Dual Reweighting Domain Generalization for Face Presentation Attack Detection

  • Shubao Liu
  • Ke-Yue Zhang
  • Taiping Yao
  • Kekai Sheng
  • Shouhong Ding
  • Ying Tai
  • Jilin Li
  • Yuan Xie

Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.

AAAI Conference 2021 Conference Paper

Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing

  • Zhihong Chen
  • Taiping Yao
  • Kekai Sheng
  • Shouhong Ding
  • Ying Tai
  • Jilin Li
  • Feiyue Huang
  • Xinyu Jin

Face anti-spoofing approach based on domain generalization (DG) has drawn growing attention due to its robustness for unseen scenarios. Existing DG methods assume that the domain label is known. However, in real-world applications, the collected dataset always contains mixture domains, where the domain label is unknown. In this case, most of existing methods may not work. Further, even if we can obtain the domain label as existing methods, we think this is just a sub-optimal partition. To overcome the limitation, we propose domain dynamic adjustment meta-learning (D2 AM) without using domain labels, which iteratively divides mixture domains via discriminative domain representation and trains a generalizable face anti-spoofing with meta-learning. Specifically, we design a domain feature based on Instance Normalization (IN) and propose a domain representation learning module (DRLM) to extract discriminative domain features for clustering. Moreover, to reduce the side effect of outliers on clustering performance, we additionally utilize maximum mean discrepancy (MMD) to align the distribution of sample features to a prior distribution, which improves the reliability of clustering. Extensive experiments show that the proposed method outperforms conventional DG-based face anti-spoofing methods, including those utilizing domain labels. Furthermore, we enhance the interpretability through visualization.

AAAI Conference 2021 Conference Paper

Local Relation Learning for Face Forgery Detection

  • Shen Chen
  • Taiping Yao
  • Yang Chen
  • Shouhong Ding
  • Jilin Li
  • Rongrong Ji

With the rapid development of facial manipulation techniques, face forgery detection has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classification problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these global supervisions are insufficient to learn a generalized feature and prone to overfitting. To address this issue, we propose a novel perspective of face forgery detection via local relation learning. Specifically, we propose a Multi-scale Patch Similarity Module (MPSM), which measures the similarity between features of local regions and forms a robust and generalized similarity pattern. Moreover, we propose an RGB-Frequency Attention Module (RFAM) to fuse information in both RGB and frequency domains for more comprehensive local feature representation, which further improves the reliability of the similarity pattern. Extensive experiments show that the proposed method consistently outperforms the state-of-the-arts on widely-used benchmarks. Furthermore, detailed visualization shows the robustness and interpretability of our method.