Arrow Research search

Author name cluster

Zhen Lei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers
2 author rows

Possible papers

26

AAAI Conference 2026 Conference Paper

Unifying Locality of KANs and Feature Drift Compensation Projection for Data-Free Replay Based Continual Face Forgery Detection

  • Tianshuo Zhang
  • Siran Peng
  • Li Gao
  • Haoyuan Zhang
  • Xiangyu Zhu
  • Zhen Lei

The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a KAN-based Continual Face Forgery Detection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.

NeurIPS Conference 2025 Conference Paper

DevFD : Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces

  • Tianshuo Zhang
  • Li Gao
  • Siran Peng
  • Xiangyu Zhu
  • Zhen Lei

The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.

ECAI Conference 2025 Conference Paper

DHLight: Multi-Agent Policy-Based Directed Hypergraph Learning for Traffic Signal Control

  • Zhen Lei
  • Zhishu Shen
  • Kang Wang
  • Zhenwei Wang
  • Tiehua Zhang

Recent advancements in Deep Reinforcement Learning (DRL) and Graph Neural Network (GNN) have demonstrated notable promise in the realm of intelligent traffic signal control, facilitating the coordination across multiple intersections. However, the traditional methods rely on standard graph structures often fail to capture the intricate higher-order spatio-temporal correlations inherent in real-world traffic dynamics. Standard graphs cannot fully represent the spatial relationships within road networks, which limits the effectiveness of graph-based approaches. In contrast, directed hypergraphs provide more accurate representation of spatial information to model complex directed relationships among multiple nodes. In this paper, we propose DHLight, a novel multi-agent policy-based framework that synergistically integrates directed hypergraph learning module. This framework introduces a novel dynamic directed hypergraph construction mechanism, which captures complex and evolving spatio-temporal relationships among intersections in road networks. By leveraging the directed hypergraph relational structure, DHLight empowers agents to achieve adaptive decision-making in traffic signal control. The effectiveness of DHLight is validated against state-of-the-art baselines through extensive experiments in various network datasets. We release the code to support the reproducibility of this work at https: //github. com/LuckyVoasem/Traffic-Light-control

AAAI Conference 2025 Conference Paper

FIRM: Flexible Interactive Reflection ReMoval

  • Xiao Chen
  • Xudong Jiang
  • Yunkang Tao
  • Zhen Lei
  • Qing Li
  • Chenyang Lei
  • Zhaoxiang Zhang

Removing reflection from a single image is challenging due to the absence of general reflection priors. Although existing methods incorporate extensive user guidance for satisfactory performance, they often lack the flexibility to adapt user guidance in different modalities, and dense user interactions further limit their practicality. To alleviate these problems, this paper presents FIRM, a novel framework for Flexible Interactive image Reflection reMoval with various forms of guidance, where users can provide sparse visual guidance (e.g., points, boxes, or strokes) or text descriptions for better reflection removal. Firstly, we design a novel user guidance conversion module (UGC) to transform different forms of guidance into unified contrastive masks. The contrastive masks provide explicit cues for identifying reflection and transmission layers in blended images. Secondly, we devise a contrastive mask-guided reflection removal network that comprises a newly proposed contrastive guidance interaction block (CGIB). This block leverages a unique cross-attention mechanism that merges contrastive masks with image features, allowing for precise layer separation. The proposed framework requires only 10% of the guidance time needed by previous interactive methods, which makes a step-change in flexibility. Extensive results on public real-world reflection removal datasets validate that our method demonstrates state-of-the-art reflection removal performance.

TMLR Journal 2025 Journal Article

MDTree: A Masked Dynamic Autoregressive Model for Phylogenetic Inference

  • Zelin Zang
  • ChenRui Duan
  • Siyuan Li
  • Jinlin Wu
  • BingoWing-Kuen Ling
  • Fuji Yang
  • Jiebo Luo
  • Zhen Lei

Phylogenetic tree inference requires optimizing both branch lengths and topologies, yet traditional MCMC-based methods suffer from slow convergence and high computational cost. Recent deep learning approaches improve scalability but remain constrained: Bayesian models are computationally intensive, autoregressive methods depend on fixed species orders, and flow-based models underutilize genomic signals. Fixed-order autoregression introduces an inductive bias misaligned with evolutionary proximity: early misplacements distort subsequent attachment probabilities and compound topology errors (exposure bias). Absent sequence-informed priors, the posterior over the super-exponential topology space remains diffuse and multimodal, yielding high-variance gradients and sluggish convergence for both MCMC proposals and neural samplers. We propose MDTree, a masked dynamic autoregressive framework that integrates genomic priors into a Dynamic Ordering Network to learn biologically informed node sequences. A dynamic masking mechanism further enables parallel node insertion, improving efficiency without sacrificing accuracy. Experiments on standard benchmarks demonstrate that MDTree outperforms existing methods in accuracy and runtime while producing biologically coherent phylogenies, providing a scalable solution for large-scale evolutionary analysis.

AAAI Conference 2025 Conference Paper

Mixture-of-Attack-Experts with Class Regularization for Unified Physical-Digital Face Attack Detection

  • Shunxin Chen
  • Ajian Liu
  • Junze Zheng
  • Jun Wan
  • Kailai Peng
  • Sergio Escalera
  • Zhen Lei

Unified detection of digital and physical attacks in facial recognition systems has become a focal point of research in recent years. However, current multi-modal methods typically ignore the intra-class and inter-class variability across different types of attacks, leading to degraded performance. To address this limitation, we propose MoAE-CR, a framework that effectively leverages class-aware information for improved attack detection. Our improvements manifest at two levels, i.e., the feature and loss level. At the feature level, we propose Mixture-of-Attack-Experts (MoAEs) to capture more subtle differences among various types of fake faces. At the loss level, we introduce Class Regularization (CR) through the Disentanglement Module (DM) and the Cluster Distillation Module (CDM). The DM enhances class separability by increasing the distance between the centers of live and fake face classes. However, center-to-center constraints alone are insufficient to ensure distinctive representations for individual features. Thus, we propose the CDM to further cluster features around their class centers while maintaining separation from other classes. Moreover, specific attacks that significantly deviate from common attack patterns are often overlooked. To address this issue, our distance calculation prioritizes more distant features. Extensive experiments on two unified physical-digital attack datasets demonstrate the state-of-the-art performance of the proposed method.

AAAI Conference 2025 Conference Paper

RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection

  • Yiheng Li
  • Yang Yang
  • Zhen Lei

In radar-camera 3D object detection, the radar point clouds are sparse and noisy, which causes difficulties in fusing camera and radar modalities. To solve this, we introduce a novel query-based detection method named Radar-Camera Transformer (RCTrans). Specifically, we first design a Radar Dense Encoder to enrich the sparse valid radar tokens, and then concatenate them with the image tokens. By doing this, we can fully explore the 3D information of each interest region and reduce the interference of empty tokens during the fusing stage. We then design a Pruning Sequential Decoder to predict 3D boxes based on the obtained tokens and random initialized queries. To alleviate the effect of elevation ambiguity in radar point clouds, we gradually locate the position of the object via a sequential fusion structure. It helps to get more precise and flexible correspondences between tokens and queries. A pruning training strategy is adopted in the decoder, which can save much time during inference and inhibit queries from losing their distinctiveness. Extensive experiments on the large-scale nuScenes dataset prove the superiority of our method, and we also achieve new state-of-the-art radar-camera 3D detection results.

AAAI Conference 2025 Conference Paper

RealisHuman: A Two-Stage Approach for Refining Malformed Human Parts in Generated Images

  • Benzhi Wang
  • Jingkai Zhou
  • Jingqi Bai
  • Yang Yang
  • Weihua Chen
  • Fan Wang
  • Zhen Lei

In recent years, diffusion models have revolutionized visual generation, outperforming traditional frameworks like Generative Adversarial Networks (GANs). However, generating images of humans with realistic semantic parts, such as hands and faces, remains a significant challenge due to their intricate structural complexity. To address this issue, we propose a novel post-processing solution named RealisHuman. The RealisHuman framework operates in two stages. First, it generates realistic human parts, such as hands or faces, using the original malformed parts as references, ensuring consistent details with the original image. Second, it seamlessly integrates the rectified human parts back into their corresponding positions by repainting the surrounding areas to ensure smooth and realistic blending. The RealisHuman framework significantly enhances the realism of human generation, as demonstrated by notable improvements in both qualitative and quantitative metrics.

IJCAI Conference 2025 Conference Paper

Top-Down Guidance for Learning Object-Centric Representations

  • Junhong Zou
  • Xiangyu Zhu
  • Zhaoxiang Zhang
  • Zhen Lei

Humans' innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans’ top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning. Code will be available at https: //github. com/zoujunhong/RHGNet.

AAAI Conference 2024 Conference Paper

Compositional Inversion for Stable Diffusion Models

  • Xulu Zhang
  • Xiao-Yong Wei
  • Jinlin Wu
  • Tianyi Zhang
  • Zhaoxiang Zhang
  • Zhen Lei
  • Qing Li

Inversion methods, such as Textual Inversion, generate personalized images by incorporating concepts of interest provided by user images. However, existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space. To address this issue, we propose a method that guides the inversion process towards the core distribution for compositional embeddings. Additionally, we introduce a spatial regularization approach to balance the attention on the concepts being composed. Our method is designed as a post-training approach and can be seamlessly integrated with other inversion methods. Experimental results demonstrate the effectiveness of our proposed approach in mitigating the overfitting problem and generating more diverse and balanced compositions of concepts in the synthesized images. The source code is available at https://github.com/zhangxulu1996/Compositional-Inversion.

AAAI Conference 2024 Conference Paper

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

  • Hao Tan
  • Jun Li
  • Yizhuang Zhou
  • Jun Wan
  • Zhen Lei
  • Xiangyu Zhang

Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T.

IJCAI Conference 2024 Conference Paper

Unified Physical-Digital Face Attack Detection

  • Hao Fang
  • Ajian Liu
  • Haocheng Yuan
  • Junze Zheng
  • Dingheng Zeng
  • Yanhong Liu
  • Jiankang Deng
  • Sergio Escalera

Face Recognition (FR) systems can suffer from physical (i. e. , print photo) and digital (i. e. , DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks which the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of 1, 800 participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 28, 706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection. Dataset link: https: //sites. google. com/view/face-anti-spoofing-challenge/dataset-download/uniattackdatacvpr2024

NeurIPS Conference 2024 Conference Paper

Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

  • Guowen Zhang
  • Lue Fan
  • Chenhang He
  • Zhen Lei
  • Zhaoxiang Zhang
  • Lei Zhang

Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency. The source code is available at https: //github. com/gwenzhang/Voxel-Mamba.

AAAI Conference 2023 Conference Paper

Grouped Knowledge Distillation for Deep Face Recognition

  • Weisong Zhao
  • Xiangyu Zhu
  • Kaiwen Guo
  • Xiao-Yu Zhang
  • Zhen Lei

Compared with the feature-based distillation methods, logits distillation can liberalize the requirements of consistent feature dimension between teacher and student networks, while the performance is deemed inferior in face recognition. One major challenge is that the light-weight student network has difficulty fitting the target logits due to its low model capacity, which is attributed to the significant number of identities in face recognition. Therefore, we seek to probe the target logits to extract the primary knowledge related to face identity, and discard the others, to make the distillation more achievable for the student network. Specifically, there is a tail group with near-zero values in the prediction, containing minor knowledge for distillation. To provide a clear perspective of its impact, we first partition the logits into two groups, i.e., Primary Group and Secondary Group, according to the cumulative probability of the softened prediction. Then, we reorganize the Knowledge Distillation (KD) loss of grouped logits into three parts, i.e., Primary-KD, Secondary-KD, and Binary-KD. Primary-KD refers to distilling the primary knowledge from the teacher, Secondary-KD aims to refine minor knowledge but increases the difficulty of distillation, and Binary-KD ensures the consistency of knowledge distribution between teacher and student. We experimentally found that (1) Primary-KD and Binary-KD are indispensable for KD, and (2) Secondary-KD is the culprit restricting KD at the bottleneck. Therefore, we propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation. Extensive experimental results on popular face recognition benchmarks demonstrate the superiority of proposed GKD over state-of-the-art methods.

AAAI Conference 2023 Conference Paper

Mixture Uniform Distribution Modeling and Asymmetric Mix Distillation for Class Incremental Learning

  • Sunyuan Qiang
  • Jiayi Hou
  • Jun Wan
  • Yanyan Liang
  • Zhen Lei
  • Du Zhang

Exemplar rehearsal-based methods with knowledge distillation (KD) have been widely used in class incremental learning (CIL) scenarios. However, they still suffer from performance degradation because of severely distribution discrepancy between training and test set caused by the limited storage memory on previous classes. In this paper, we mathematically model the data distribution and the discrepancy at the incremental stages with mixture uniform distribution (MUD). Then, we propose the asymmetric mix distillation method to uniformly minimize the error of each class from distribution discrepancy perspective. Specifically, we firstly promote mixup in CIL scenarios with the incremental mix samplers and incremental mix factor to calibrate the raw training data distribution. Next, mix distillation label augmentation is incorporated into the data distribution to inherit the knowledge information from the previous models. Based on the above augmented data distribution, our trained model effectively alleviates the performance degradation and extensive experimental results validate that our method exhibits superior performance on CIL benchmarks.

AAAI Conference 2022 Conference Paper

Deconfounding Physical Dynamics with Global Causal Relation and Confounder Transmission for Counterfactual Prediction

  • Zongzhao Li
  • Xiangyu Zhu
  • Zhen Lei
  • Zhaoxiang Zhang

Discovering the underneath causal relations is the fundamental ability for reasoning about the surrounding environment and predicting the future states in the physical world. Counterfactual prediction from visual input, which requires simulating future states based on unrealized situations in the past, is a vital component in causal relation tasks. In this paper, we work on the confounders that have effect on the physical dynamics, including masses, friction coefficients, etc. , to bridge relations between the intervened variable and the affected variable whose future state may be altered. We propose a neural network framework combining Global Causal Relation Attention (GCRA) and Confounder Transmission Structure (CTS). The GCRA looks for the latent causal relations between different variables and estimates the confounders by capturing both spatial and temporal information. The CTS integrates and transmits the learnt confounders in a residual way, so that the estimated confounders can be encoded into the network as a constraint for object positions when performing counterfactual prediction. Without any access to ground truth information about confounders, our model outperforms the state-of-the-art method on various benchmarks by fully utilizing the constraints of confounders. Extensive experiments demonstrate that our model can generalize to unseen environments and maintain good performance.

AAAI Conference 2021 Conference Paper

Searching for Alignment in Face Recognition

  • Xiaqing Xu
  • Qiang Meng
  • Yunxiao Qin
  • Jianzhu Guo
  • Chenxu Zhao
  • Feng Zhou
  • Zhen Lei

A standard pipeline of current face recognition frameworks consists of four individual steps: locating a face with a rough bounding box and several fiducial landmarks, aligning the face image using a pre-defined template, extracting representations and comparing. Among them, face detection, landmark detection and representation learning have long been studied and a lot of works have been proposed. As an essential step with a significant impact on recognition performance, the alignment step has attracted little attention. In this paper, we first explore and highlight the effects of different alignment templates on face recognition. Then, for the first time, we try to search for the optimal template automatically. We construct a well-defined searching space by decomposing the template searching into the crop size and vertical shift, and propose an efficient method Face Alignment Policy Search (FAPS). Besides, a well-designed benchmark is proposed to evaluate the searched policy. Experiments on our proposed benchmark validate the effectiveness of our method to improve face recognition performance.

AAAI Conference 2020 Conference Paper

Learning Meta Model for Zero- and Few-Shot Face Anti-Spoofing

  • Yunxiao Qin
  • Chenxu Zhao
  • Xiangyu Zhu
  • Zezheng Wang
  • Zitong Yu
  • Tianyu Fu
  • Feng Zhou
  • Jingping Shi

Face anti-spoofing is crucial to the security of face recognition systems. Most previous methods formulate face antispoofing as a supervised learning problem to detect various predefined presentation attacks, which need large scale training data to cover as many attacks as possible. However, the trained model is easy to overfit several common attacks and is still vulnerable to unseen attacks. To overcome this challenge, the detector should: 1) learn discriminative features that can generalize to unseen spoofing types from predefined presentation attacks; 2) quickly adapt to new spoofing types by learning from both the predefined attacks and a few examples of the new spoofing types. Therefore, we define face anti-spoofing as a zero- and few-shot learning problem. In this paper, we propose a novel Adaptive Inner-update Meta Face Anti-Spoofing (AIM-FAS) method to tackle this problem through meta-learning. Specifically, AIM-FAS trains a meta-learner focusing on the task of detecting unseen spoofing types by learning from predefined living and spoofing faces and a few examples of new attacks. To assess the proposed approach, we propose several benchmarks for zeroand few-shot FAS. Experiments show its superior performances on the presented benchmarks to existing methods in existing zero-shot FAS protocols.

AAAI Conference 2020 Conference Paper

PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

Pedestrian detection in crowded scenes is a challenging problem, because occlusion happens frequently among different pedestrians. In this paper, we propose an effective and efficient detection network to hunt pedestrians in crowd scenes. The proposed method, namely PedHunter, introduces strong occlusion handling ability to existing region-based detection networks without bringing extra computations in the inference stage. Specifically, we design a mask-guided module to leverage the head information to enhance the feature representation learning of the backbone network. Moreover, we develop a strict classification criterion by improving the quality of positive samples during training to eliminate common false positives of pedestrian detection in crowded scenes. Besides, we present an occlusion-simulated data augmentation to enrich the pattern and quantity of occlusion samples to improve the occlusion robustness. As a consequent, we achieve state-of-the-art results on three pedestrian detection datasets including CityPersons, Caltech-USA and CrowdHuman. To facilitate further studies on the occluded pedestrian detection in surveillance scenes, we release a new pedestrian dataset, called SUR-PED, with a total of over 162k highquality manually labeled instances in 10k images. The proposed dataset, source codes and trained models are available at https: //github. com/ChiCheng123/PedHunter.

AAAI Conference 2020 Conference Paper

Relational Learning for Joint Head and Human Detection

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

Head and human detection have been rapidly improved with the development of deep convolutional neural networks. However, these two tasks are often studied separately without considering their inherent correlation, leading to that 1) head detection is often trapped in more false positives, and 2) the performance of human detector frequently drops dramatically in crowd scenes. To handle these two issues, we present a novel joint head and human detection network, namely JointDet, which effectively detects head and human body simultaneously. Moreover, we design a head-body relationship discriminating module to perform relational learning between heads and human bodies, and leverage this learned relationship to regain the suppressed human detections and reduce head false positives. To verify the effectiveness of the proposed method, we annotate head bounding boxes of the CityPersons and Caltech-USA datasets, and conduct extensive experiments on the CrowdHuman, CityPersons and Caltech-USA datasets. As a consequence, the proposed Joint- Det detector achieves state-of-the-art performance on these three benchmarks. To facilitate further studies on the head and human detection problem, all new annotations, source codes and trained models are available at https: //github. com/ ChiCheng123/JointDet.

AAAI Conference 2019 Conference Paper

Selective Refinement Network for High Performance Face Detection

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

High performance face detection remains a very challenging problem, especially when there exists many tiny faces. This paper presents a novel single-shot face detector, named Selective Refinement Network (SRN), which introduces novel twostep classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. In particular, the SRN consists of two modules: the Selective Two-step Classification (STC) module and the Selective Two-step Regression (STR) module. The STC aims to filter out most simple negative anchors from low level detection layers to reduce the search space for the subsequent classifier, while the STR is designed to coarsely adjust the locations and sizes of anchors from high level detection layers to provide better initialization for the subsequent regressor. Moreover, we design a Receptive Field Enhancement (RFE) block to provide more diverse receptive field, which helps to better capture faces in some extreme poses. As a consequence, the proposed SRN detector achieves state-of-the-art performance on all the widely used face detection benchmarks, including AFW, PASCAL face, FDDB, and WIDER FACE datasets. Codes will be released to facilitate further studies on the face detection problem.

IJCAI Conference 2018 Conference Paper

Ensemble Soft-Margin Softmax Loss for Image Classification

  • Xiaobo Wang
  • Shifeng Zhang
  • Zhen Lei
  • Si Liu
  • Xiaojie Guo
  • Stan Z. Li

Softmax loss is arguably one of the most popular losses to train CNN models for image classification. However, recent works have exposed its limitation on feature discriminability. This paper casts a new viewpoint on the weakness of softmax loss. On the one hand, the CNN features learned using the softmax loss are often inadequately discriminative. We hence introduce a soft-margin softmax function to explicitly encourage the discrmination between different classes. On the other hand, the learned classifier of softmax loss is weak. We propose to assemble multiple these weak classifiers to a strong one, inspired by the recognition that the diversity among weak classifiers is critical to a good ensemble. To achieve the diversity, we adopt the Hilbert-Schmidt Independence Criterion (HSIC). Considering these two aspects in one framework, we design a novel loss, named as Ensemble Soft-Margin Softmax (EM-Softmax). Extensive experiments on benchmark datasets are conducted to show the superiority of our design over the baseline softmax loss and several state-of-the-art alternatives.

IS Journal 2018 Journal Article

Trends and Controversies

  • Hugo Proenca
  • Mark Nixon
  • Michele Nappi
  • Esam Ghaleb
  • Gokhan Ozbulak
  • Hua Gao
  • Hazim Kemal Ekenel
  • Klemen Grm

Performing covert biometric recognition in surveillance environments has been regarded as a grand challenge, considering the adversity of the conditions where recognition should be carried out (e. g. , poor resolution, bad lighting, off-pose and partially occluded data). This special issue compiles a group of approaches to this problem.

AAAI Conference 2016 Conference Paper

Large Scale Similarity Learning Using Similar Pairs for Person Verification

  • Yang Yang
  • Shengcai Liao
  • Zhen Lei
  • Stan Li

In this paper, we propose a novel similarity measure and then introduce an efficient strategy to learn it by using only similar pairs for person verification. Unlike existing metric learning methods, we consider both the difference and commonness of an image pair to increase its discriminativeness. Under a pairconstrained Gaussian assumption, we show how to obtain the Gaussian priors (i. e. , corresponding covariance matrices) of dissimilar pairs from those of similar pairs. The application of a log likelihood ratio makes the learning process simple and fast and thus scalable to large datasets. Additionally, our method is able to handle heterogeneous data well. Results on the challenging datasets of face verification (LFW and Pub- Fig) and person re-identification (VIPeR) show that our algorithm outperforms the state-of-the-art methods.

AAAI Conference 2016 Conference Paper

Metric Embedded Discriminative Vocabulary Learning for High-Level Person Representation

  • Yang Yang
  • Zhen Lei
  • Shifeng Zhang
  • Hailin Shi
  • Stan Li

A variety of encoding methods for bag of word (BoW) model have been proposed to encode the local features in image classification. However, most of them are unsupervised and just employ k-means to form the visual vocabulary, thus reducing the discriminative power of the features. In this paper, we propose a metric embedded discriminative vocabulary learning for high-level person representation with application to person re-identification. A new and effective term is introduced which aims at making the same persons closer while different ones farther in the metric space. With the learned vocabulary, we utilize a linear coding method to encode the imagelevel features (or holistic image features) for extracting highlevel person representation. Different from traditional unsupervised approaches, our method can explore the relationship (same or not) among the persons. Since there is an analytic solution to the linear coding, it is easy to obtain the final high-level features. The experimental results on person reidentification demonstrate the effectiveness of our proposed algorithm.