Author name cluster

Jilin Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

1 author row

AAAI Conference 2022 Conference Paper

Delving into the Local: Dynamic Inconsistency Learning for DeepFake Video Detection

Zhihao Gu
Yang Chen
Taiping Yao
Shouhong Ding
Jilin Li
Lizhuang Ma

The rapid development of facial manipulation techniques has aroused public concerns in recent years. Existing deepfake video detection approaches attempt to capture the discriminative features between real and fake faces based on temporal modelling. However, these works impose supervisions on sparsely sampled video frames but overlook the local motions among adjacent frames, which instead encode rich inconsistency information that can serve as an efficient indicator for DeepFake video detection. To mitigate this issue, we delves into the local motion and propose a novel sampling unit named snippet which contains a few successive videos frames for local temporal inconsistency learning. Moreover, we elaborately design an Intra-Snippet Inconsistency Module (Intra-SIM) and an Inter-Snippet Interaction Module (Inter- SIM) to establish a dynamic inconsistency modelling framework. Specifically, the Intra-SIM applies bi-directional temporal difference operations and a learnable convolution kernel to mine the short-term motions within each snippet. The Inter-SIM is then devised to promote the cross-snippet information interaction to form global representations. The Intra- SIM and Inter-SIM work in an alternate manner and can be plugged into existing 2D CNNs. Our method outperforms the state of the art competitors on four popular benchmark dataset, i. e. , FaceForensics++, Celeb-DF, DFDC and Wild- Deepfake. Besides, extensive experiments and visualizations are also presented to further illustrate its effectiveness.

PDF Details

AAAI Conference 2022 Conference Paper

Dual Contrastive Learning for General Face Forgery Detection

Ke Sun
Taiping Yao
Shen Chen
Shouhong Ding
Jilin Li
Rongrong Ji

With various facial manipulation techniques arising, face forgery detection has drawn growing attention due to security concerns. Previous works always formulate face forgery detection as a classification problem based on cross-entropy loss, which emphasizes category-level differences rather than the essential discrepancies between real and fake faces, limiting model generalization in unseen domains. To address this issue, we propose a novel face forgery detection framework, named Dual Contrastive Learning (DCL), which specially constructs positive and negative paired data and performs designed contrastive learning at different granularities to learn generalized feature representation. Concretely, combined with the hard sample selection strategy, Inter-Instance Contrastive Learning (Inter-ICL) is first proposed to promote task-related discriminative features learning by especially constructing instance pairs. Moreover, to further explore the essential discrepancies, Intra-Instance Contrastive Learning (Intra-ICL) is introduced to focus on the local content inconsistencies prevalent in the forged faces by constructing localregion pairs inside instances. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art competitors. Our Code is available at https: //github. com/Tencent/TFace. git.

PDF Details

IJCAI Conference 2021 Conference Paper

Adv-Makeup: A New Imperceptible and Transferable Attack on Face Recognition

Bangjie Yin
Wenxuan Wang
Taiping Yao
Junfeng Guo
Zelun Kong
Shouhong Ding
Jilin Li
Cong Liu

Deep neural networks, particularly face recognition models, have been shown to be vulnerable to both digital and physical adversarial examples. However, existing adversarial examples against face recognition systems either lack transferability to black-box models, or fail to be implemented in practice. In this paper, we propose a unified adversarial face generation method - Adv-Makeup, which can realize imperceptible and transferable attack under the black-box setting. Adv-Makeup develops a task-driven makeup generation method with the blending module to synthesize imperceptible eye shadow over the orbital region on faces. And to achieve transferability, Adv-Makeup implements a fine-grained meta-learning based adversarial attack strategy to learn more vulnerable or sensitive features from various models. Compared to existing techniques, sufficient visualization results demonstrate that Adv-Makeup is capable to generate much more imperceptible attacks under both digital and physical scenarios. Meanwhile, extensive quantitative experiments show that Adv-Makeup can significantly improve the attack success rate under black-box setting, even attacking commercial systems.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Dual Reweighting Domain Generalization for Face Presentation Attack Detection

Shubao Liu
Ke-Yue Zhang
Taiping Yao
Kekai Sheng
Shouhong Ding
Ying Tai
Jilin Li
Yuan Xie

Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Frequency Consistent Adaptation for Real World Super Resolution

Xiaozhong Ji
Guangpin Tao
Yun Cao
Ying Tai
Tong Lu
Chengjie Wang
Jilin Li
Feiyue Huang

Recent deep-learning based Super-Resolution (SR) methods have achieved remarkable performance on images with known degradation. However, these methods always fail in real-world scene, since the Low-Resolution (LR) images after the ideal degradation (e. g. , bicubic down-sampling) deviate from real source domain. The domain gap between the LR images and the real-world images can be observed clearly on frequency density, which inspires us to explicitly narrow the undesired gap caused by incorrect degradation. From this point of view, we design a novel Frequency Consistent Adaptation (FCA) that ensures the frequency domain consistency when applying existing SR methods to the real scene. We estimate degradation kernels from unsupervised images and generate the corresponding LR images. To provide useful gradient information for kernel estimation, we propose Frequency Density Comparator (FDC) by distinguishing the frequency density of images on different scales. Based on the domain-consistent LR-HR pairs, we train easy-implemented Convolutional Neural Network (CNN) SR models. Extensive experiments show that the proposed FCA improves the performance of the SR model under real-world setting achieving state-of-the-art results with high fidelity and plausible perception, thus providing a novel effective framework for realworld SR application.

PDF Details

AAAI Conference 2021 Conference Paper

Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing

Zhihong Chen
Taiping Yao
Kekai Sheng
Shouhong Ding
Ying Tai
Jilin Li
Feiyue Huang
Xinyu Jin

Face anti-spoofing approach based on domain generalization (DG) has drawn growing attention due to its robustness for unseen scenarios. Existing DG methods assume that the domain label is known. However, in real-world applications, the collected dataset always contains mixture domains, where the domain label is unknown. In this case, most of existing methods may not work. Further, even if we can obtain the domain label as existing methods, we think this is just a sub-optimal partition. To overcome the limitation, we propose domain dynamic adjustment meta-learning (D2 AM) without using domain labels, which iteratively divides mixture domains via discriminative domain representation and trains a generalizable face anti-spoofing with meta-learning. Specifically, we design a domain feature based on Instance Normalization (IN) and propose a domain representation learning module (DRLM) to extract discriminative domain features for clustering. Moreover, to reduce the side effect of outliers on clustering performance, we additionally utilize maximum mean discrepancy (MMD) to align the distribution of sample features to a prior distribution, which improves the reliability of clustering. Extensive experiments show that the proposed method outperforms conventional DG-based face anti-spoofing methods, including those utilizing domain labels. Furthermore, we enhance the interpretability through visualization.

PDF Details

IJCAI Conference 2021 Conference Paper

HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping

Yuhan Wang
Xu Chen
Junwei Zhu
Wenqing Chu
Ying Tai
Chengjie Wang
Jilin Li
Yongjian Wu

In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods. Code is available at: https: //johann. wang/HifiFace

PDF Details DOI

AAAI Conference 2021 Conference Paper

Learning a Few-shot Embedding Model with Contrastive Learning

Chen Liu
Yanwei Fu
Chengming Xu
Siqian Yang
Jilin Li
Chengjie Wang
Li Zhang

Few-shot learning (FSL) aims to recognize target classes by adapting the prior knowledge learned from source classes. Such knowledge usually resides in a deep embedding model for a general matching purpose of the support and query image pairs. The objective of this paper is to repurpose the contrastive learning for such matching to learn a few-shot embedding model. We make the following contributions: (i) We investigate the contrastive learning with Noise Contrastive Estimation (NCE) in a supervised manner for training a fewshot embedding model; (ii) We propose a novel contrastive training scheme dubbed infoPatch, exploiting the patch-wise relationship to substantially improve the popular infoNCE; (iii) We show that the embedding learned by the proposed infoPatch is more effective; (iv) Our model is thoroughly evaluated on few-shot recognition task; and demonstrates state-ofthe-art results on miniImageNet and appealing performance on tieredImageNet, Fewshot-CIFAR100 (FC-100).

PDF Details

AAAI Conference 2021 Conference Paper

Learning Comprehensive Motion Representation for Action Recognition

Mingyu Wu
Boyuan Jiang
Donghao Luo
Junchi Yan
Yabiao Wang
Ying Tai
Chengjie Wang
Jilin Li

For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing interframe connections while still suffering the limited temporal receptive field or high latency. Moreover, the feature enhancement is often only performed by channel or space dimension in action recognition. To address these issues, we first devise a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. The channel gates generated by CME incorporate the information from all the other frames in the video. We further propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps. The intuition is that the change of background is typically slower than the motion area. Both CME and SME have clear physical meaning in capturing action clues. By integrating the two modules into the off-the-shelf 2D network, we finally obtain a Comprehensive Motion Representation (CMR) learning method for action recognition, which achieves competitive performance on Something-Something V1 & V2 and Kinetics-400. On the temporal reasoning datasets Something-Something V1 and V2, our method outperforms the current state-of-the-art by 2. 3% and 1. 9% when using 16 frames as input, respectively.

PDF Details

AAAI Conference 2021 Conference Paper

Local Relation Learning for Face Forgery Detection

Shen Chen
Taiping Yao
Yang Chen
Shouhong Ding
Jilin Li
Rongrong Ji

With the rapid development of facial manipulation techniques, face forgery detection has received considerable attention in digital media forensics due to security concerns. Most existing methods formulate face forgery detection as a classiﬁcation problem and utilize binary labels or manipulated region masks as supervision. However, without considering the correlation between local regions, these global supervisions are insufﬁcient to learn a generalized feature and prone to overﬁtting. To address this issue, we propose a novel perspective of face forgery detection via local relation learning. Speciﬁcally, we propose a Multi-scale Patch Similarity Module (MPSM), which measures the similarity between features of local regions and forms a robust and generalized similarity pattern. Moreover, we propose an RGB-Frequency Attention Module (RFAM) to fuse information in both RGB and frequency domains for more comprehensive local feature representation, which further improves the reliability of the similarity pattern. Extensive experiments show that the proposed method consistently outperforms the state-of-the-arts on widely-used benchmarks. Furthermore, detailed visualization shows the robustness and interpretability of our method.

PDF Details

AAAI Conference 2021 Conference Paper

To Choose or to Fuse? Scale Selection for Crowd Counting

Qingyu Song
Changan Wang
Yabiao Wang
Ying Tai
Chengjie Wang
Jilin Li
Jian Wu
Jiayi Ma

In this paper, we address the large scale variation problem in crowd counting by taking full advantage of the multiscale feature representations in a multi-level network. We implement such an idea by keeping the counting error of a patch as small as possible with a proper feature level selection strategy, since a specific feature level tends to perform better for a certain range of scales. However, without scale annotations, it is sub-optimal and error-prone to manually assign the predictions for heads of different scales to specific feature levels. Therefore, we propose a Scale-Adaptive Selection Network (SASNet), which automatically learns the internal correspondence between the scales and the feature levels. Instead of directly using the predictions from the most appropriate feature level as the final estimation, our SASNet also considers the predictions from other feature levels via weighted average, which helps to mitigate the gap between discrete feature levels and continuous scale variation. Since the heads in a local patch share roughly a same scale, we conduct the adaptive selection strategy in a patch-wise style. However, pixels within a patch contribute different counting errors due to the various difficulty degrees of learning. Thus, we further propose a Pyramid Region Awareness Loss (PRA Loss) to recursively select the most hard sub-regions within a patch until reaching the pixel level. With awareness of whether the parent patch is over-estimated or under-estimated, the fine-grained optimization with the PRA Loss for these region-aware hard pixels helps to alleviate the inconsistency problem between training target and evaluation metric. The state-of-the-art results on four datasets demonstrate the superiority of our approach. The code will be available at: https: //github. com/TencentYoutuResearch/CrowdCounting- SASNet.

PDF Details

AAAI Conference 2020 Conference Paper

Fast Learning of Temporal Action Proposal via Dense Boundary Generator

Chuming Lin
Jian Li
Yabiao Wang
Ying Tai
Donghao Luo
Zhipeng Cui
Chengjie Wang
Jilin Li

Generating temporal action proposals remains a very challenging problem, where the main issue lies in predicting precise temporal proposal boundaries and reliable action conﬁdence in long and untrimmed real-world videos. In this paper, we propose an efﬁcient and uniﬁed framework to generate temporal action proposals named Dense Boundary Generator (DBG), which draws inspiration from boundary-sensitive methods and implements boundary classiﬁcation and action completeness regression for densely distributed proposals. In particular, the DBG consists of two modules: Temporal boundary classiﬁcation (TBC) and Action-aware completeness regression (ACR). The TBC aims to provide two temporal boundary conﬁdence maps by low-level two-stream features, while the ACR is designed to generate an action completeness score map by high-level action-aware features. Moreover, we introduce a dual stream BaseNet (DSB) to encode RGB and optical ﬂow information, which helps to capture discriminative boundary and actionness features. Extensive experiments on popular benchmarks ActivityNet-1. 3 and THUMOS14 demonstrate the superiority of DBG over the state-of-the-art proposal generator (e. g. , MGG and BMN).

PDF Details

AAAI Conference 2020 Conference Paper

TEINet: Towards an Efficient Architecture for Video Recognition

Zhaoyang Liu
Donghao Luo
Yabiao Wang
Limin Wang
Ying Tai
Chengjie Wang
Jilin Li
Feiyue Huang

Efﬁciency is an important issue in designing video architectures for action recognition. 3D CNNs have witnessed remarkable progress in action recognition from videos. However, compared with their 2D counterparts, 3D convolutions often introduce a large amount of parameters and cause high computational cost. To relieve this problem, we propose an efﬁcient temporal module, termed as Temporal Enhancementand-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet). The TEI module presents a different paradigm to learn temporal features by decoupling the modeling of channel correlation and temporal interaction. First, it contains a Motion Enhanced Module (MEM) which is to enhance the motion-related features while suppress irrelevant information (e. g. , background). Then, it introduces a Temporal Interaction Module (TIM) which supplements the temporal contextual information in a channel-wise manner. This two-stage modeling scheme is not only able to capture temporal structure ﬂexibly and effectively, but also efﬁcient for model inference. We conduct extensive experiments to verify the effectiveness of TEINet on several benchmarks (e. g. , Something-Something V1&V2, Kinetics, UCF101 and HMDB51). Our proposed TEINet can achieve a good recognition accuracy on these datasets but still preserve a high efﬁciency.

PDF Details

AAAI Conference 2019 Conference Paper

Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos

Ying Tai
Yicong Liang
Xiaoming Liu
Lei Duan
Jilin Li
Chengjie Wang
Feiyue Huang
Yu Chen

In recent years, heatmap regression based models have shown their effectiveness in face alignment and pose estimation. However, Conventional Heatmap Regression (CHR) is not accurate nor stable when dealing with high-resolution facial videos, since it finds the maximum activated location in heatmaps which are generated from rounding coordinates, and thus leads to quantization errors when scaling back to the original high-resolution space. In this paper, we propose a Fractional Heatmap Regression (FHR) for high-resolution video-based face alignment. The proposed FHR can accurately estimate the fractional part according to the 2D Gaussian function by sampling three points in heatmaps. To further stabilize the landmarks among continuous video frames while maintaining the precise at the same time, we propose a novel stabilization loss that contains two terms to address time delay and non-smooth issues, respectively. Experiments on 300W, 300- VW and Talking Face datasets clearly demonstrate that the proposed method is more accurate and stable than the state-ofthe-art models.

PDF Details