Arrow Research search

Author name cluster

Errui Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers
2 author rows

Possible papers

32

ICML Conference 2025 Conference Paper

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

  • Yang Shen 0006
  • Xiu-Shen Wei
  • Yifan Sun 0003
  • Yuxin Song
  • Tao Yuan
  • Jian Jin
  • He-Yang Xu
  • Yazhou Yao

Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e. g. , "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.

AAAI Conference 2025 Conference Paper

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models

  • Guosheng Zhang
  • Keyao Wang
  • Haixiao Yue
  • Ajian Liu
  • Gang Zhang
  • Kun Yao
  • Errui Ding
  • Jingdong Wang

Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.

ICLR Conference 2025 Conference Paper

MGMapNet: Multi-Granularity Representation Learning for End-to-End Vectorized HD Map Construction

  • Jing Yang
  • Minyue Jiang
  • Sen Yang
  • Xiao Tan 0001
  • Yingying Li
  • Errui Ding
  • Jingdong Wang 0001
  • Hanli Wang

The construction of vectorized high-definition map typically requires capturing both category and geometry information of map elements. Current state-of-the-art methods often adopt solely either point-level or instance-level representation, overlooking the strong intrinsic relationship between points and instances. In this work, we propose a simple yet efficient framework named MGMapNet (multi-granularity map network) to model map elements with multi-granularity representation, integrating both coarse-grained instance-level and fine-grained point-level queries. Specifically, these two granularities of queries are generated from the multi-scale bird's eye view features using a proposed multi-granularity aggregator. In this module, instance-level query aggregates features over the entire scope covered by an instance, and the point-level query aggregates features locally. Furthermore, a point-instance interaction module is designed to encourage information exchange between instance-level and point-level queries. Experimental results demonstrate that the proposed MGMapNet achieves state-of-the-art performances, surpassing MapTRv2 by 5.3 mAP on the nuScenes dataset and 4.4 mAP on the Argoverse2 dataset, respectively.

ICLR Conference 2025 Conference Paper

Uni2Det: Unified and Universal Framework for Prompt-Guided Multi-dataset 3D Detection

  • Yubin Wang
  • Zhikang Zou
  • Xiaoqing Ye
  • Xiao Tan 0001
  • Errui Ding
  • Cairong Zhao

We present Uni$^2$Det, a brand new framework for unified and universal multi-dataset training on 3D detection, enabling robust performance across diverse domains and generalization to unseen domains. Due to substantial disparities in data distribution and variations in taxonomy across diverse domains, training such a detector by simply merging datasets poses a significant challenge. Motivated by this observation, we introduce multi-stage prompting modules for multi-dataset 3D detection, which leverages prompts based on the characteristics of corresponding datasets to mitigate existing differences. This elegant design facilitates seamless plug-and-play integration within various advanced 3D detection frameworks in a unified manner, while also allowing straightforward adaptation for universal applicability across datasets. Experiments are conducted across multiple dataset consolidation scenarios involving KITTI, Waymo, and nuScenes, demonstrating that our Uni$^2$Det outperforms existing methods by a large margin in multi-dataset training. Notably, results on zero-shot cross-dataset transfer validate the generalization capability of our proposed method. Our code is available at https://github.com/ThomasWangY/Uni2Det.

TMLR Journal 2024 Journal Article

MaskOCR: Scene Text Recognition with Masked Vision-Language Pre-training

  • Pengyuan Lyu
  • Chengquan Zhang
  • Shanshan Liu
  • Meina Qiao
  • Yangliu Xu
  • Liang Wu
  • Kun Yao
  • Junyu Han

Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional language model, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme. Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images. The code for our approach will be made available.

AAAI Conference 2024 Conference Paper

Multi-Domain Incremental Learning for Face Presentation Attack Detection

  • Keyao Wang
  • Guosheng Zhang
  • Haixiao Yue
  • Ajian Liu
  • Gang Zhang
  • Haocheng Feng
  • Junyu Han
  • Errui Ding

Previous face Presentation Attack Detection (PAD) methods aim to improve the effectiveness of cross-domain tasks. However, in real-world scenarios, the original training data of the pre-trained model is not available due to data privacy or other reasons. Under these constraints, general methods for fine-tuning single-target domain data may lose previously learned knowledge, leading to a catastrophic forgetting problem. To address these issues, we propose a multi-domain incremental learning (MDIL) method for PAD, which not only learns knowledge well from the new domain but also maintains the performance of previous domains stably. Specifically, we propose an adaptive domain-specific experts (ADE) framework based on the vision transformer to preserve the discriminability of previous domains. Furthermore, an asymmetric classifier is designed to keep the output distribution of different classifiers consistent, thereby improving the generalization ability. Extensive experiments show that our proposed method achieves state-of-the-art performance compared to prior methods of incremental learning. Excitingly, under more stringent setting conditions, our method approximates or even outperforms the DA/DG-based methods.

NeurIPS Conference 2024 Conference Paper

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

  • Chuyang Zhao
  • YuXin Song
  • Junru Chen
  • Kang Rong
  • Haocheng Feng
  • Gang Zhang
  • Shufan Ji
  • Jingdong Wang

A mainstream of Multi-modal Large Language Models (MLLMs) have two essential functions, i. e. , visual recognition (e. g. , grounding) and understanding (e. g. , visual question answering). Presently, all these MLLMs integrate visual recognition and understanding in a same sequential manner in the LLM head, i. e. , generating the response token-by-token for both recognition and understanding. We think unifying them in the same sequential manner is not optimal for two reasons: 1) parallel recognition is more efficient than sequential recognition and is actually prevailing in deep visual recognition, and 2) the recognition results can be integrated to help high-level cognition (while the current manner does not). Such motivated, this paper proposes a novel “parallel recognition → sequential understanding” framework for MLLMs. The bottom LLM layers are utilized for parallel recognition and the recognition results are relayed into the top LLM layers for sequential understanding. Specifically, parallel recognition in the bottom LLM layers is implemented via object queries, a popular mechanism in DEtection TRansformer, which we find to harmonize well with the LLM layers. Empirical studies show our MLLM named Octopus improves accuracy on popular MLLM tasks and is up to 5× faster on visual grounding tasks.

NeurIPS Conference 2024 Conference Paper

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

  • Yanmin Wu
  • Jiarui Meng
  • Haijie Li
  • Chenming Wu
  • Yahao Shi
  • Xinhua Cheng
  • Chen Zhao
  • Haocheng Feng

This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) that possesses the capability for 3D point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations. To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency. These features exhibit both intra-object consistency and inter-object distinction. Then, we propose a two-stage codebook to discretize these features from coarse to fine levels. At the coarse level, we consider the positional information of 3D points to achieve location-based clustering, which is then refined at the fine level. Finally, we introduce an instance-level 3D-2D feature association method that links 3D points to 2D masks, which are further associated with 2D CLIP features. Extensive experiments, including open vocabulary-based 3D object selection, 3D point cloud understanding, click-based 3D object selection, and ablation studies, demonstrate the effectiveness of our proposed method. The source code is available at our project page https: //3d-aigc. github. io/OpenGaussian.

NeurIPS Conference 2024 Conference Paper

ShowMaker: Creating High-Fidelity 2D Human Video via Fine-Grained Diffusion Modeling

  • Quanwei Yang
  • Jiazhi Guan
  • Kaisiyuan Wang
  • Lingyun Yu
  • Wenqing Chu
  • Hang Zhou
  • Zhiqiang Feng
  • Haocheng Feng

Although significant progress has been made in human video generation, most previous studies focus on either human facial animation or full-body animation, which cannot be directly applied to produce realistic conversational human videos with frequent hand gestures and various facial movements simultaneously. To address these limitations, we propose a 2D human video generation framework, named ShowMaker, capable of generating high-fidelity half-body conversational videos via fine-grained diffusion modeling. We leverage dual-stream diffusion models as the backbone of our framework and carefully design two novel components for crucial local regions (i. e. , hands and face) that can be easily integrated into our backbone. Specifically, to handle the challenging hand generation caused by sparse motion guidance, we propose a novel Key Point-based Fine-grained Hand Modeling module by amplifying positional information from raw hand key points and constructing a corresponding key point-based codebook. Moreover, to restore richer facial details in generated results, we introduce a Face Recapture module, which extracts facial texture features and global identity features from the aligned human face and integrates them into the diffusion process for face enhancement. Extensive quantitative and qualitative experiments demonstrate the superior visual quality and temporal consistency of our method.

ICML Conference 2024 Conference Paper

Towards Unified Multi-granularity Text Detection with Interactive Attention

  • Xingyu Wan
  • Chengquan Zhang
  • Pengyuan Lyu
  • Sen Fan
  • Zihan Ni
  • Kun Yao
  • Errui Ding
  • Jingdong Wang 0001

Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including word, line, paragraph and page. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT’s accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks.

TMLR Journal 2023 Journal Article

CAE v2: Context Autoencoder with CLIP Latent Alignment

  • Xinyu Zhang
  • Jiahui Chen
  • Junkun Yuan
  • Qiang Chen
  • Jian Wang
  • Xiaodi Wang
  • Shumin Han
  • Xiaokang Chen

Masked image modeling (MIM) learns visual representations by predicting the masked patches on a pre-defined target. Inspired by MVP(Wei et al., 2022b) that displays impressive gains with CLIP, in this work, we also employ the semantically rich CLIP latent as target and further tap its potential by introducing a new MIM pipeline, CAE v2, to learn a high-quality encoder and facilitate model convergence on the pre-training task. CAE v2 is an improved variant of CAE (Chen et al., 2023), applying the CLIP latent on two pretraining tasks, i.e., visible latent alignment and masked latent alignment. Visible latent alignment directly mimics the visible latent representations from the encoder to the corresponding CLIP latent, which is beneficial for facilitating model convergence and improving the representative ability of the encoder. Masked latent alignment predicts the representations of masked patches within the feature space of CLIP latent as standard MIM task does, effectively aligning the representations computed from the encoder and the regressor into the same domain. We pretrain CAE v2 on ImageNet-1K images and evaluate on various downstream vision tasks, including image classification, semantic segmentation, object detection and instance segmentation. Experiments show that our CAE v2 achieves competitive performance and even outperforms the CLIP vision encoder, demonstrating the effectiveness of our method. Code is available at https://github.com/Atten4Vis/CAE.

AAAI Conference 2023 Conference Paper

Cyclically Disentangled Feature Translation for Face Anti-spoofing

  • Haixiao Yue
  • Keyao Wang
  • Guosheng Zhang
  • Haocheng Feng
  • Junyu Han
  • Errui Ding
  • Jingdong Wang

Current domain adaptation methods for face anti-spoofing leverage labeled source domain data and unlabeled target domain data to obtain a promising generalizable decision boundary. However, it is usually difficult for these methods to achieve a perfect domain-invariant liveness feature disentanglement, which may degrade the final classification performance by domain differences in illumination, face category, spoof type, etc. In this work, we tackle cross-scenario face anti-spoofing by proposing a novel domain adaptation method called cyclically disentangled feature translation network (CDFTN). Specifically, CDFTN generates pseudo-labeled samples that possess: 1) source domain-invariant liveness features and 2) target domain-specific content features, which are disentangled through domain adversarial training. A robust classifier is trained based on the synthetic pseudo-labeled images under the supervision of source domain labels. We further extend CDFTN for multi-target domain adaptation by leveraging data from more unlabeled target domains. Extensive experiments on several public datasets demonstrate that our proposed approach significantly outperforms the state of the art. Code and models are available at https://github.com/vis-face/CDFTN.

ICLR Conference 2023 Conference Paper

Graph Contrastive Learning for Skeleton-based Action Recognition

  • Xiaohu Huang
  • Hao Zhou 0039
  • Jian Wang 0066
  • Haocheng Feng
  • Junyu Han
  • Errui Ding
  • Jingdong Wang 0001
  • Xinggang Wang

In the field of skeleton-based action recognition, current top-performing graph convolutional networks (GCNs) exploit intra-sequence context to construct adaptive graphs for feature aggregation. However, we argue that such context is still $\textit{local}$ since the rich cross-sequence relations have not been explicitly investigated. In this paper, we propose a graph contrastive learning framework for skeleton-based action recognition ($\textit{SkeletonGCL}$) to explore the $\textit{global}$ context across all sequences. In specific, SkeletonGCL associates graph learning across sequences by enforcing graphs to be class-discriminative, i.e., intra-class compact and inter-class dispersed, which improves the GCN capacity to distinguish various action patterns. Besides, two memory banks are designed to enrich cross-sequence context from two complementary levels, i.e., instance and semantic levels, enabling graph contrastive learning in multiple context scales. Consequently, SkeletonGCL establishes a new training paradigm, and it can be seamlessly incorporated into current GCNs. Without loss of generality, we combine SkeletonGCL with three GCNs (2S-ACGN, CTR-GCN, and InfoGCN), and achieve consistent improvements on NTU60, NTU120, and NW-UCLA benchmarks.

NeurIPS Conference 2023 Conference Paper

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

  • Junkun Yuan
  • Xinyu Zhang
  • Hao Zhou
  • Jian Wang
  • Zhongwei Qiu
  • Zhiyin Shao
  • Shaofeng Zhang
  • Sifan Long

Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Specifically, we employ this prior to guide the mask sampling process. Image patches, corresponding to human part regions, have high priority to be masked out. This encourages the model to concentrate more on body structure information during pre-training, yielding substantial benefits across a range of human-centric perception tasks. To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image. We term the entire method as HAP. HAP simply uses a plain ViT as the encoder yet establishes new state-of-the-art performance on 11 human-centric benchmarks, and on-par result on one dataset. For example, HAP achieves 78. 1% mAP on MSMT17 for person re-identification, 86. 54% mA on PA-100K for pedestrian attribute recognition, 78. 2% AP on MS COCO for 2D pose estimation, and 56. 0 PA-MPJPE on 3DPW for 3D pose and shape estimation.

AAAI Conference 2023 Conference Paper

Robust Video Portrait Reenactment via Personalized Representation Quantization

  • Kaisiyuan Wang
  • Changcheng Liang
  • Hang Zhou
  • Jiaxiang Tang
  • Qianyi Wu
  • Dongliang He
  • Zhibin Hong
  • Jingtuo Liu

While progress has been made in the field of portrait reenactment, the problem of how to produce high-fidelity and robust videos remains. Recent studies normally find it challenging to handle rarely seen target poses due to the limitation of source data. This paper proposes the Video Portrait via Non-local Quantization Modeling (VPNQ) framework, which produces pose- and disturbance-robust reenactable video portraits. Our key insight is to learn position-invariant quantized local patch representations and build a mapping between simple driving signals and local textures with non-local spatial-temporal modeling. Specifically, instead of learning a universal quantized codebook, we identify that a personalized one can be trained to preserve desired position-invariant local details better. Then, a simple representation of projected landmarks can be used as sufficient driving signals to avoid 3D rendering. Following, we employ a carefully designed Spatio-Temporal Transformer to predict reasonable and temporally consistent quantized tokens from the driving signal. The predicted codes can be decoded back to robust and high-quality videos. Comprehensive experiments have been conducted to validate the effectiveness of our approach.

AAAI Conference 2023 Conference Paper

StereoDistill: Pick the Cream from LiDAR for Distilling Stereo-Based 3D Object Detection

  • Zhe Liu
  • Xiaoqing Ye
  • Xiao Tan
  • Errui Ding
  • Xiang Bai

In this paper, we propose a cross-modal distillation method named StereoDistill to narrow the gap between the stereo and LiDAR-based approaches via distilling the stereo detectors from the superior LiDAR model at the response level, which is usually overlooked in 3D object detection distillation. The key designs of StereoDistill are: the X-component Guided Distillation~(XGD) for regression and the Cross-anchor Logit Distillation~(CLD) for classification. In XGD, instead of empirically adopting a threshold to select the high-quality teacher predictions as soft targets, we decompose the predicted 3D box into sub-components and retain the corresponding part for distillation if the teacher component pilot is consistent with ground truth to largely boost the number of positive predictions and alleviate the mimicking difficulty of the student model. For CLD, we aggregate the probability distribution of all anchors at the same position to encourage the highest probability anchor rather than individually distill the distribution at the anchor level. Finally, our StereoDistill achieves state-of-the-art results for stereo-based 3D detection on the KITTI test benchmark and extensive experiments on KITTI and Argoverse Dataset validate the effectiveness.

ICLR Conference 2023 Conference Paper

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

  • Yuechen Yu
  • Yulin Li
  • Chengquan Zhang
  • Xiaoqiang Zhang 0006
  • Zengyuan Guo
  • Xiameng Qin
  • Kun Yao
  • Junyu Han

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

NeurIPS Conference 2022 Conference Paper

Delving into Sequential Patches for Deepfake Detection

  • Jiazhi Guan
  • Hang Zhou
  • Zhibin Hong
  • Errui Ding
  • Jingdong Wang
  • Chengbin Quan
  • Youjian Zhao

Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions. As a result, researchers have been devoted to deepfake detection. Previous studies have identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods, however, they still suffer from robustness problem against post-processings. In this work, we propose the Local- & Temporal-aware Transformer-based Deepfake Detection (LTTD) framework, which adopts a local-to-global learning protocol with a particular focus on the valuable temporal information within local sequences. Specifically, we propose a Local Sequence Transformer (LST), which models the temporal consistency on sequences of restricted spatial regions, where low-level information is hierarchically enhanced with shallow layers of learned 3D filters. Based on the local temporal embeddings, we then achieve the final classification in a global contrastive way. Extensive experiments on popular datasets validate that our approach effectively spots local forgery cues and achieves state-of-the-art performance.

AAAI Conference 2022 Conference Paper

MobileFaceSwap: A Lightweight Framework for Video Face Swapping

  • Zhiliang Xu
  • Zhibin Hong
  • Changxing Ding
  • Zhen Zhu
  • Junyu Han
  • Jingtuo Liu
  • Errui Ding

Advanced face swapping methods have achieved appealing results. However, most of these methods have many parameters and computations, which makes it challenging to apply them in real-time applications or deploy them on edge devices like mobile phones. In this work, we propose a lightweight Identity-aware Dynamic Network (IDN) for subject-agnostic face swapping by dynamically adjusting the model parameters according to the identity information. In particular, we design an efficient Identity Injection Module (IIM) by introducing two dynamic neural network techniques, including the weights prediction and weights modulation. Once the IDN is updated, it can be applied to swap faces given any target image or video. The presented IDN contains only 0. 50M parameters and needs 0. 33G FLOPs per frame, making it capable for real-time video face swapping on mobile phones. In addition, we introduce a knowledge distillationbased method for stable training, and a loss reweighting module is employed to obtain better synthesized results. Finally, our method achieves comparable results with the teacher models and other state-of-the-art methods.

NeurIPS Conference 2022 Conference Paper

RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer

  • Jian Wang
  • Chenhui Gou
  • Qiman Wu
  • Haocheng Feng
  • Junyu Han
  • Errui Ding
  • Jingdong Wang

Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, our RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, we find that cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K.

IJCAI Conference 2022 Conference Paper

Self-Guided Hard Negative Generation for Unsupervised Person Re-Identification

  • Dongdong Li
  • Zhigang Wang
  • Jian Wang
  • Xinyu Zhang
  • Errui Ding
  • Jingdong Wang
  • Zhaoxiang Zhang

Recent unsupervised person re-identification (reID) methods mostly apply pseudo labels from clustering algorithms as supervision signals. Despite great success, this fashion is very likely to aggregate different identities with similar appearances into the same cluster. In result, the hard negative samples, playing important role in training reID models, are significantly reduced. To alleviate this problem, we propose a self-guided hard negative generation method for unsupervised person re-ID. Specifically, a joint framework is developed which incorporates a hard negative generation network (HNGN) and a re-ID network. To continuously generate harder negative samples to provide effective supervisions in the contrastive learning, the two networks are alternately trained in an adversarial manner to improve each other, where the reID network guides HNGN to generate challenging data and HNGN enforces the re-ID network to enhance discrimination ability. During inference, the performance of re-ID network is improved without introducing any extra parameters. Extensive experiments demonstrate that the proposed method significantly outperforms a strong baseline and also achieves better results than state-of-the-art methods.

NeurIPS Conference 2022 Conference Paper

Singular Value Fine-tuning: Few-shot Segmentation requires Few-parameters Fine-tuning

  • Yanpeng Sun
  • Qiang Chen
  • Xiangyu He
  • Jian Wang
  • Haocheng Feng
  • Junyu Han
  • Errui Ding
  • Jian Cheng

Freezing the pre-trained backbone has become a standard paradigm to avoid overfitting in few-shot segmentation. In this paper, we rethink the paradigm and explore a new regime: {\em fine-tuning a small part of parameters in the backbone}. We present a solution to overcome the overfitting problem, leading to better model generalization on learning novel classes. Our method decomposes backbone parameters into three successive matrices via the Singular Value Decomposition (SVD), then {\em only fine-tunes the singular values} and keeps others frozen. The above design allows the model to adjust feature representations on novel classes while maintaining semantic clues within the pre-trained backbone. We evaluate our {\em Singular Value Fine-tuning (SVF)} approach on various few-shot segmentation methods with different backbones. We achieve state-of-the-art results on both Pascal-5$^i$ and COCO-20$^i$ across 1-shot and 5-shot settings. Hopefully, this simple baseline will encourage researchers to rethink the role of backbone fine-tuning in few-shot settings.

NeurIPS Conference 2021 Conference Paper

Dual-stream Network for Visual Recognition

  • Mingyuan Mao
  • Peng Gao
  • Renrui Zhang
  • Honghui Zheng
  • Teli Ma
  • Yan Peng
  • Errui Ding
  • Baochang Zhang

Transformers with remarkable global representation capacities achieve competitive results for visual tasks, but fail to consider high-level local pattern information in input images. In this paper, we present a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. Our DS-Net can simultaneously calculate fine-grained and integrated features and efficiently fuse them. Specifically, we propose an Intra-scale Propagation module to process two different resolutions in each block and an Inter-Scale Alignment module to perform information interaction across features at dual scales. Besides, we also design a Dual-stream FPN (DS-FPN) to further enhance contextual information for downstream dense predictions. Without bells and whistles, the proposed DS-Net outperforms DeiT-Small by 2. 4\% in terms of top-1 accuracy on ImageNet-1k and achieves state-of-the-art performance over other Vision Transformers and ResNets. For object detection and instance segmentation, DS-Net-Small respectively outperforms ResNet-50 by 6. 4\% and 5. 5 \% in terms of mAP on MSCOCO 2017, and surpasses the previous state-of-the-art scheme, which significantly demonstrates its potential to be a general backbone in vision tasks. The code will be released soon.

AAAI Conference 2021 Conference Paper

FaceController: Controllable Attribute Editing for Face in the Wild

  • Zhiliang Xu
  • Xiyu Yu
  • Zhibin Hong
  • Zhen Zhu
  • Junyu Han
  • Jingtuo Liu
  • Errui Ding
  • Xiang Bai

Face attribute editing aims to generate faces with one or multiple desired face attributes manipulated while other details are preserved. Unlike prior works such as GAN inversion, which has an expensive reverse mapping process, we propose a simple feed-forward network to generate high-fidelity manipulated faces. By simply employing some existing and easy-obtainable prior information, our method can control, transfer, and edit diverse attributes of faces in the wild. The proposed method can consequently be applied to various applications such as face swapping, face relighting, and makeup transfer. In our method, we decouple identity, expression, pose, and illumination using 3D priors; separate texture and colors by using region-wise style codes. All the information is embedded into adversarial learning by our identity-style normalization module. Disentanglement losses are proposed to enhance the generator to extract information independently from each attribute. Comprehensive quantitative and qualitative evaluations have been conducted. In a single framework, our method achieves the best or competitive scores on a variety of face applications.

AAAI Conference 2021 Conference Paper

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

  • Wenhao Wu
  • Dongliang He
  • Tianwei Lin
  • Fu Li
  • Chuang Gan
  • Errui Ding

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H ×W ×T video frames as space-time signal (viewing from the Height- Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-theshelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i. e. , Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance but maintain 2D CNN’s complexity.

AAAI Conference 2021 Conference Paper

PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Network

  • Pengfei Wang
  • Chengquan Zhang
  • Fei Qi
  • Shanshan Liu
  • Xiaoqiang Zhang
  • Pengyuan Lyu
  • Junyu Han
  • Jingtuo Liu

The reading of arbitrarily-shaped text has received increasing research attention. However, existing text spotters are mostly built on two-stage frameworks or character-based methods, which suffer from either Non-Maximum Suppression (NMS), Region-of-Interest (RoI) operations, or character-level annotations. In this paper, to address the above problems, we propose a novel fully convolutional Point Gathering Network (PGNet) for reading arbitrarily-shaped text in real-time. The PGNet is a single-shot text spotter, where the pixel-level character classification map is learned with proposed PG-CTC loss avoiding the usage of character-level annotations. With a PG-CTC decoder, we gather high-level character classification vectors from two-dimensional space and decode them into text symbols without NMS and RoI operations involved, which guarantees high efficiency. Additionally, reasoning the relations between each character and its neighbors, a graph refinement module (GRM) is proposed to optimize the coarse recognition and improve the end-to-end performance. Experiments prove that the proposed method achieves competitive accuracy, meanwhile significantly improving the running speed. In particular, in Total-Text, it runs at 46. 7 FPS, surpassing the previous spotters with a large margin.

IJCAI Conference 2021 Conference Paper

Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

  • Jie Wu
  • Wei Zhang
  • Guanbin Li
  • Wenhao Wu
  • Xiao Tan
  • Yingying Li
  • Errui Ding
  • Liang Lin

In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i. e. , a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse video-level annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i. e. , ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.

NeurIPS Conference 2020 Conference Paper

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

  • Di Hu
  • Rui Qian
  • Minyue Jiang
  • Xiao Tan
  • Shilei Wen
  • Errui Ding
  • Weiyao Lin
  • Dejing Dou

Discriminatively localizing sounding objects in cocktail-party, i. e. , mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https: //github. com/DTaoo/Discriminative-Sounding-Objects-Localization.

AAAI Conference 2020 Conference Paper

Dynamic Instance Normalization for Arbitrary Style Transfer

  • Yongcheng Jing
  • Xiao Liu
  • Yukang Ding
  • Xinchao Wang
  • Errui Ding
  • Mingli Song
  • Shilei Wen

Prior normalization methods rely on affine transformations to produce arbitrary image style transfers, of which the parameters are computed in a pre-defined way. Such manuallydefined nature eventually results in the high-cost and shared encoders for both style and content encoding, making style transfer systems cumbersome to be deployed in resourceconstrained environments like on the mobile-terminal side. In this paper, we propose a new and generalized normalization module, termed as Dynamic Instance Normalization (DIN), that allows for flexible and more efficient arbitrary style transfers. Comprising an instance normalization and a dynamic convolution, DIN encodes a style image into learnable convolution parameters, upon which the content image is stylized. Unlike conventional methods that use shared complex encoders to encode content and style, the proposed DIN introduces a sophisticated style encoder, yet comes with a compact and lightweight content encoder for fast inference. Experimental results demonstrate that the proposed approach yields very encouraging results on challenging style patterns and, to our best knowledge, for the first time enables an arbitrary style transfer using MobileNet-based lightweight architecture, leading to a reduction factor of more than twenty in computational cost as compared to existing approaches. Furthermore, the proposed DIN provides flexible support for stateof-the-art convolutional operations, and thus triggers novel functionalities, such as uniform-stroke placement for nonnatural images and automatic spatial-stroke control.

AAAI Conference 2020 Conference Paper

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

  • Zhenbo Xu
  • Wei Zhang
  • Xiaoqing Ye
  • Xiao Tan
  • Wei Yang
  • Shilei Wen
  • Errui Ding
  • Ajin Meng

3D object detection is an essential task in autonomous driving and robotics. Though great progress has been made, challenges remain in estimating 3D pose for distant and occluded objects. In this paper, we present a novel framework named ZoomNet for stereo imagery-based 3D detection. The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of leftright bounding boxes. To further exploit the abundant texture cues in rgb images for more accurate disparity estimation, we introduce a conceptually straight-forward module – adaptive zooming, which simultaneously resizes 2D instance bounding boxes to a unified resolution and adjusts the camera intrinsic parameters accordingly. In this way, we are able to estimate higher-quality disparity maps from the resized box images then construct dense point clouds for both nearby and distant objects. Moreover, we introduce to learn part locations as complementary features to improve the resistance against occlusion and put forward the 3D fitting score to better estimate the 3D detection quality. Extensive experiments on the popular KITTI 3D detection dataset indicate ZoomNet surpasses all previous state-of-the-art methods by large margins (improved by 9. 4% on APbv (IoU=0. 7) over pseudo-LiDAR). Ablation study also demonstrates that our adaptive zooming strategy brings an improvement of over 10% on AP3d (IoU=0. 7). In addition, since the official KITTI benchmark lacks fine-grained annotations like pixel-wise part locations, we also present our KFG dataset by augmenting KITTI with detailed instance-wise annotations including pixel-wise part location, pixel-wise disparity, etc. . Both the KFG dataset and our codes will be publicly available at https: //github. com/detectRecog/ZoomNet.

NeurIPS Conference 2018 Conference Paper

Compact Generalized Non-local Network

  • Kaiyu Yue
  • Ming Sun
  • Yuchen Yuan
  • Feng Zhou
  • Errui Ding
  • Fuxin Xu

The non-local module is designed for capturing long-range spatio-temporal dependencies in images and videos. Although having shown excellent performance, it lacks the mechanism to model the interactions between positions across channels, which are of vital importance in recognizing fine-grained objects and actions. To address this limitation, we generalize the non-local module and take the correlations between the positions of any two channels into account. This extension utilizes the compact representation for multiple kernel functions with Taylor expansion that makes the generalized non-local module in a fast and low-complexity computation flow. Moreover, we implement our generalized non-local method within channel groups to ease the optimization. Experimental results illustrate the clear-cut improvements and practical applicability of the generalized non-local module on both fine-grained object recognition and video classification. Code is available at: https: //github. com/KaiyuYue/cgnl-network. pytorch.

AAAI Conference 2017 Conference Paper

Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition

  • Xiao Liu
  • Jiang Wang
  • Shilei Wen
  • Errui Ding
  • Yuanqing Lin

A key challenge in fine-grained recognition is how to find and represent discriminative local regions. Recent attention models are capable of learning discriminative region localizers only from category labels with reinforcement learning. However, not utilizing any explicit part information, they are not able to accurately find multiple distinctive regions. In this work, we introduce an attribute-guided attention localization scheme where the local region localizers are learned under the guidance of part attribute descriptions. By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm. The attribute labeling requirement of the scheme is more amenable than the accurate part location annotation required by traditional part-based fine-grained recognition methods. Experimental results on the CUB-200- 2011 dataset (Wah et al. 2011) demonstrate the superiority of the proposed scheme on both fine-grained recognition and attribute recognition.