Author name cluster

Zilei Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

2 author rows

AAAI Conference 2026 Conference Paper

PosPrune: Visual Token Pruning with Positional Bias Correction for Efficient Large Vision-Language Models

Ziyang Wang
Mengwei Li
Hao Yin
Wenhao Liu
Zilei Wang

Large Vision-Language Models (LVLMs) enhance performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, the large number of visual tokens introduces significant computational overhead. Existing token pruning methods either perform global selection via [CLS]-based attention in the vision encode or prune within LLM decoding layers. These approaches face two key challenges: (1) [CLS]-based attention primarily focuses on visually salient regions across the entire image, often overlooking semantically important tokens essential for reasoning; and (2) strong positional bias in the shallow decoder layers causes the model to favor later-positioned tokens, while neglecting earlier ones that may carry critical reasoning cues. To address these issues, we propose PosPrune, a training-free, two-stage visual token pruning framework. At the vision encoder, we introduce an Asymmetric Region-aware Pruning (ARP) strategy that retains more tokens in semantically rich regions while discarding more tokens from semantically less informative regions, thus preserving spatial diversity and task-relevant details. In the LLM decoding stage, we find that the positional bias in shallow layers is primarily driven by model architecture rather than task semantics. Based on this insight, we propose a novel Positional Bias Correction (PBC) mechanism to mitigate this bias. To further reduce redundancy, we apply Maximal Marginal Relevance (MMR) to select tokens that best balance textual relevance and diversity. Extensive experiments on various LVLMs and benchmarks demonstrate the general effectiveness of our approach. Notably, when applied to LLaVA-1.5-7B, PosPrune achieves a reduction of 85% in FLOPs while preserving 98.5% of the original performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Rethinking Open-world Prompt Tuning: A Systematic Framework for Evaluation and Optimization

Mengwei Li
Zilei Wang
Yixin Zhang

Prompt Tuning (PT) is a widely used strategy for adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks. Conventional PT methods evaluate performance separately on known (base) and unknown (new) classes. However, in real-world scenarios, models often encounter inputs without prior knowledge of their class domain. This challenge has motivated the development of Open-world Prompt Tuning (OPT), which requires models to first determine whether a sample belongs to base or new classes and then classify it accordingly. In this work, we carefully review existing OPT methods and identify three key limitations: (L1) incomplete evaluation metrics, (L2) time-consuming and memory-intensive OOD detection methods, and (L3) insufficiently comprehensive optimization strategies. To address these issues, we first tackle L1 by proposing two novel metrics to explicitly evaluate adaptability and generalization under the OPT setting, forming a more comprehensive evaluation framework. For L2, we propose a training-free OOD detection method called Entropy-weighted Rank-normalized Fusion (ERF), which first applies rank normalization to both the maximum and the sum of base-class probabilities, followed by an entropy-weighted fusion of the normalized values. For L3, we propose a plug-and-play Gated Dual-Merging (GDM) strategy to strengthen the classifier’s capability. GDM performs selective merging at the weight level based on an adaptive criterion and combines fine-tuned and LLM-boosted logits at the output level. Extensive experiments on three PT baselines across 11 datasets demonstrate the effectiveness of our proposed ERF and GDM.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Training-free Boosting for Few-shot Segmentation via Generalizing Semantic Mining

Kangyu Xiao
Zilei Wang
Yixin Zhang
Junjie Li

Few-shot Semantic Segmentation (FSS) aims to segment the novel target objects with the guidance of minimal annotated reference examples. The affinity-based method has great advantages in the FSS inference stage for both specialist model and foundation model. However, current affinity calculation merely relies on only support-query matching, without considering the query-specific semantic or the semantic correlation among inter-support samples, which limits the representation ability of affinity map. In this paper, we propose the Generalizing Semantic Mining (GSM) that focuses on exploiting generalizing semantic to improve the affinity calculation. Concretely, we first organize the affinity-based inference into three main steps to reveal the crucial role of affinity map. To address the low-data problem, Target Semantic Reusing module considers the query sample as a proxy reference and assigns it with proxy mask identifying its most generalizing semantic regions. Then, to generate the high-fidelity proxy mask, Query-specific Semantic Modeling module pinpoints the most generalizing regions through prior semantic analysis. Finally, Representative Re-weighting module explicitly modulates affinity calculation via generalization-aware weighting. Experiments on FSS benchmarks demonstrate that our GSM can serve as a plug-and-play free lunch for both specialist models and foundation models.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Exploring Vacant Classes in Label-Skewed Federated Learning

Kuangpu Guo
Yuhe Ding
Jian Liang
Zilei Wang
Ran He
Tieniu Tan

Label skews, characterized by disparities in local label distribution across clients, pose a significant challenge in federated learning. As minority classes suffer from worse accuracy due to overfitting on local imbalanced data, prior methods often incorporate class-balanced learning techniques during local training. Although these methods improve the mean accuracy across all classes, we observe that vacant classes—referring to categories absent from a client's data distribution—remain poorly recognized. Besides, there is still a gap in the accuracy of local models on minority classes compared to the global model. This paper introduces FedVLS, a novel approach to label-skewed federated learning that integrates both vacant-class distillation and logit suppression simultaneously. Specifically, vacant-class distillation leverages knowledge distillation during local training on each client to retain essential information related to vacant classes from the global model. Moreover, logit suppression directly penalizes network logits for non-label classes, effectively addressing misclassifications in minority classes that may be biased toward majority classes. Extensive experiments validate the efficacy of FedVLS, demonstrating superior performance compared to previous state-of-the-art (SOTA) methods across diverse datasets with varying degrees of label skews.

PDF Details DOI

AAAI Conference 2025 Conference Paper

GCD: Advancing Vision-Language Models for Incremental Object Detection via Global Alignment and Correspondence Distillation

Xu Wang
Zilei Wang
Zihan Lin

Incremental object detection (IOD) is a challenging task that requires detection models to continuously learn from newly arriving data. This work focuses on incremental learning for vision-language detectors (VLDs), an under explored domain. Existing research typically adopts a local alignment paradigm to avoid label conflicts, where different tasks are learned separately without interaction. However, we reveal that this practice fails to effectively preserve the semantic structure. Specifically, aligned relationships between objects and texts would collapse when handling novel categories, ultimately leading to catastrophic forgetting. Though knowledge distillation (KD) is a common approach for tackling this, traditional KD performs poorly when directly applied to VLDs, as for different phases, a natural knowledge gap exists in both encoding and decoding processes. To address above issues, we propose a novel method called Global alignment and Correspondence Distillation (GCD). Differently, we first integrate knowledge across phases within the same embedding space to construct global semantic structure. We then enable effective knowledge distillation in VLDs through a semantic correspondence mechanism, ensuring consistent proposal generation and decoding. On the top of that, we distill teacher model’s informative predictions and topological relationships to maintain stable local semantic structure. Extensive experiments on COCO 2017 demonstrate that our method significantly outperforms existing approaches, achieving new state-of-the-art in various IOD scenarios.

PDF Details DOI

ICLR Conference 2025 Conference Paper

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Zhengbo Wang
Jian Liang 0001
Ran He 0001
Zilei Wang
Tieniu Tan

Low-rank adaptation, also known as LoRA, has emerged as a prominent method for parameter-efficient fine-tuning of foundation models. Despite its computational efficiency, LoRA still yields inferior performance compared to full fine-tuning. In this paper, we first uncover a fundamental connection between the optimization processes of LoRA and full fine-tuning: using LoRA for optimization is mathematically equivalent to full fine-tuning using a low-rank gradient for parameter updates. And this low-rank gradient can be expressed in terms of the gradients of the two low-rank matrices in LoRA. Leveraging this insight, we introduce LoRA-Pro, a method that enhances LoRA's performance by strategically adjusting the gradients of these low-rank matrices. This adjustment allows the low-rank gradient to more accurately approximate the full fine-tuning gradient, thereby narrowing the performance gap between LoRA and full fine-tuning. Furthermore, we theoretically derive the optimal solutions for adjusting the gradients of the low-rank matrices, applying them during fine-tuning in LoRA-Pro. We conduct extensive experiments across natural language understanding, dialogue generation, mathematical reasoning, code generation, and image classification tasks, demonstrating that LoRA-Pro substantially improves LoRA's performance, effectively narrowing the gap with full fine-tuning. Our code is publicly available at https://github.com/mrflogs/LoRA-Pro.

Details

AAAI Conference 2025 Conference Paper

Protecting Model Adaptation from Trojans in the Unlabeled Data

Lijun Sheng
Jian Liang
Ran He
Zilei Wang
Tieniu Tan

Model adaptation tackles the distribution shift problem with a pre-trained model instead of raw data, which has become a popular paradigm due to its great privacy protection. Existing methods always assume adapting to a clean target domain, overlooking the security risks of unlabeled samples. This paper for the first time explores the potential trojan attacks on model adaptation launched by well-designed poisoning target data. Concretely, we provide two trigger patterns with two poisoning strategies for different prior knowledge owned by attackers. These attacks achieve a high success rate while maintaining the normal performance on clean samples in the test stage. To defend against such backdoor injection, we propose a plug-and-play method named DiffAdapt, which can be seamlessly integrated with existing adaptation algorithms. Experiments across commonly used benchmarks and adaptation methods demonstrate the effectiveness of DiffAdapt. We hope this work will shed light on the safety of transfer learning with unlabeled data.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation

Weinan He
Zilei Wang
Yixin Zhang

Universal Domain Adaptation (UniDA) focuses on transferring source domain knowledge to the target domain under both domain shift and unknown category shift. Its main challenge lies in identifying common class samples and aligning them. Current methods typically obtain target domain semantics centers from an unconstrained continuous image representation space. Due to domain shift and the unknown number of clusters, these centers often result in complex and less robust alignment algorithm. In this paper, based on vision-language models, we search for semantic centers in a semantically meaningful and discrete text representation space. The constrained space ensures almost no domain bias and appropriate semantic granularity for these centers, enabling a simple and robust adaptation algorithm. Specifically, we propose TArget Semantics Clustering (TASC) via Text Representations, which leverages information maximization as a unified objective and involves two stages. First, with the frozen encoders, a greedy search-based framework is used to search for an optimal set of text embeddings to represent target semantics. Second, with the search results fixed, encoders are refined based on gradient descent, simultaneously achieving robust domain alignment and private class clustering. Additionally, we propose Universal Maximum Similarity (UniMS), a scoring function tailored for detecting open-set samples in UniDA. Experimentally, we evaluate the universality of UniDA algorithms under four category shift scenarios. Extensive experiments on four benchmarks demonstrate the effectiveness and robustness of our method, which has achieved state-of-the-art performance.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Lijun Sheng
Jian Liang
Ran He
Zilei Wang
Tieniu Tan

Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and make it difficult to assess their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP—a model trained with a Sigmoid loss—and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies. The code is available in https: //github. com/TomSheng21/tta-vlm.

PDF Details

NeurIPS Conference 2025 Conference Paper

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

Hao Yin
Guangzong Si
Zilei Wang

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model’s output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

PDF Details

AAAI Conference 2024 Conference Paper

A Dynamic Learning Method towards Realistic Compositional Zero-Shot Learning

Xiaoming Hu
Zilei Wang

To tackle the challenge of recognizing images of unseen attribute-object compositions, Compositional Zero-Shot Learning (CZSL) methods have been previously addressed. However, test images in realistic scenarios may also incorporate other forms of unknown factors, such as novel semantic concepts or novel image styles. As previous CZSL works have overlooked this critical issue, in this research, we first propose the Realistic Compositional Zero-Shot Learning (RCZSL) task which considers the various types of unknown factors in an unified experimental setting. To achieve this, we firstly conduct re-labelling on MIT-States and use the pre-trained generative models to obtain images of various domains. Then the entire dataset is split into a training set and a test set, with the latter containing images of unseen concepts, unseen compositions, unseen domains as well as their combinations. Following this, we show that the visual-semantic relationship changes on unseen images, leading us to construct two dynamic modulators to adapt the visual features and composition prototypes in accordance with the input image. We believe that such a dynamic learning method could effectively alleviate the domain shift problem caused by various types of unknown factors. We conduct extensive experiments on benchmark datasets for both the conventional CZSL setting and the proposed RCZSL setting. The effectiveness of our method has been proven by empirical results, which significantly outperformed both our baseline method and state-of-the-art approaches.

PDF Details DOI

ICLR Conference 2024 Conference Paper

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

Zhengbo Wang
Jian Liang 0001
Lijun Sheng
Ran He 0001
Zilei Wang
Tieniu Tan

Contrastive Language-Image Pretraining (CLIP) has gained popularity for its remarkable zero-shot capacity. Recent research has focused on developing efficient fine-tuning methods, such as prompt learning and adapter, to enhance CLIP's performance in downstream tasks. However, these methods still require additional training time and computational resources, which is undesirable for devices with limited resources. In this paper, we revisit a classical algorithm, Gaussian Discriminant Analysis (GDA), and apply it to the downstream classification of CLIP. Typically, GDA assumes that features of each class follow Gaussian distributions with identical covariance. By leveraging Bayes' formula, the classifier can be expressed in terms of the class means and covariance, which can be estimated from the data without the need for training. To integrate knowledge from both visual and textual modalities, we ensemble it with the original zero-shot classifier within CLIP. Extensive results on 17 datasets validate that our method surpasses or achieves comparable results with state-of-the-art methods on few-shot classification, imbalanced learning, and out-of-distribution generalization. In addition, we extend our method to base-to-new generalization and unsupervised learning, once again demonstrating its superiority over competing approaches. Our code is publicly available at https://github.com/mrflogs/ICLR24.

Details

ICML Conference 2024 Conference Paper

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

Zhengbo Wang
Jian Liang 0001
Ran He 0001
Zilei Wang
Tieniu Tan

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a C ollabo ra tive F ine- T uning ( CraFT ) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8, 000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1. 62% compared to the white-box method. Our code is publicly available at https: //github. com/mrflogs/CraFT.

Details

IJCAI Conference 2024 Conference Paper

Probabilistic Contrastive Learning for Domain Adaptation

Junjie Li
Yixin Zhang
Zilei Wang
Saihui Hou
Keyu Tu
Man Zhang

Contrastive learning has shown impressive success in enhancing feature discriminability for various visual tasks in a self-supervised manner, but the standard contrastive paradigm (features+l2 normalization) has limited benefits when applied in domain adaptation. We find that this is mainly because the class weights (weights of the final fully connected layer) are ignored in the domain adaptation optimization process, which makes it difficult for features to cluster around the corresponding class weights. To solve this problem, we propose the simple but powerful Probabilistic Contrastive Learning (PCL), which moves beyond the standard paradigm by removing l2 normalization and replacing the features with probabilities. PCL can guide the probability distribution towards a one-hot configuration, thus minimizing the discrepancy between features and class weights. We conduct extensive experiments to validate the effectiveness of PCL and observe consistent performance gains on five tasks, i. e. , Unsupervised/Semi-Supervised Domain Adaptation (UDA/SSDA), Semi-Supervised Learning (SSL), UDA Detection and Semantic Segmentation. Notably, for UDA Semantic Segmentation on SYNTHIA, PCL surpasses the sophisticated CPSL-D by 2% in terms of mean IoU with a much lower training cost (PCL: 1*3090, 5 days v. s. CPSL-D: 4*V100, 11 days). Code is available at https: //github. com/ljjcoder/Probabilistic-Contrastive-Learning.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Actionness Inconsistency-Guided Contrastive Learning for Weakly-Supervised Temporal Action Localization

Zhilin Li
Zilei Wang
Qinying Liu

Weakly-supervised temporal action localization (WTAL) aims to detect action instances given only video-level labels. To address the challenge, recent methods commonly employ a two-branch framework, consisting of a class-aware branch and a class-agnostic branch. In principle, the two branches are supposed to produce the same actionness activation. However, we observe that there are actually many inconsistent activation regions. These inconsistent regions usually contain some challenging segments whose semantic information (action or background) is ambiguous. In this work, we propose a novel Actionness Inconsistency-guided Contrastive Learning (AICL) method which utilizes the consistent segments to boost the representation learning of the inconsistent segments. Specifically, we first define the consistent and inconsistent segments by comparing the predictions of two branches and then construct positive and negative pairs between consistent segments and inconsistent segments for contrastive learning. In addition, to avoid the trivial case where there is no consistent sample, we introduce an action consistency constraint to control the difference between the two branches. We conduct extensive experiments on THUMOS14, ActivityNet v1.2, and ActivityNet v1.3 datasets, and the results show the effectiveness of AICL with state-of-the-art performance. Our code is available at https://github.com/lizhilin-ustc/AAAI2023-AICL.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Exploit Domain-Robust Optical Flow in Domain Adaptive Video Semantic Segmentation

Yuan Gao
Zilei Wang
Jiafan Zhuang
Yixin Zhang
Junjie Li

Domain adaptive semantic segmentation aims to exploit the pixel-level annotated samples on source domain to assist the segmentation of unlabeled samples on target domain. For such a task, the key is to construct reliable supervision signals on target domain. However, existing methods can only provide unreliable supervision signals constructed by segmentation model (SegNet) that are generally domain-sensitive. In this work, we try to find a domain-robust clue to construct more reliable supervision signals. Particularly, we experimentally observe the domain-robustness of optical flow in video tasks as it mainly represents the motion characteristics of scenes. However, optical flow cannot be directly used as supervision signals of semantic segmentation since both of them essentially represent different information. To tackle this issue, we first propose a novel Segmentation-to-Flow Module (SFM) that converts semantic segmentation maps to optical flows, named the segmentation-based flow (SF), and then propose a Segmentation-based Flow Consistency (SFC) method to impose consistency between SF and optical flow, which can implicitly supervise the training of segmentation model. The extensive experiments on two challenging benchmarks demonstrate the effectiveness of our method, and it outperforms previous state-of-the-art methods with considerable performance improvement. Our code is available at https://github.com/EdenHazardan/SFC.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Leveraging Sub-class Discimination for Compositional Zero-Shot Learning

Xiaoming Hu
Zilei Wang

Compositional Zero-Shot Learning (CZSL) aims at identifying unseen compositions composed of previously seen attributes and objects during the test phase. In real images, the visual appearances of attributes and objects (primitive concepts) generally interact with each other. Namely, the visual appearances of an attribute may change when composed with different objects, and vice versa. But previous works overlook this important property. In this paper, we introduce a simple yet effective approach with leveraging sub-class discrimination. Specifically, we define the primitive concepts in different compositions as sub-classes, and then maintain the sub-class discrimination to address the above challenge. More specifically, inspired by the observation that the composed recognition models could account for the differences across sub-classes, we first propose to impose the embedding alignment between the composed and disentangled recognition to incorporate sub-class discrimination at the feature level. Then we develop the prototype modulator networks to adjust the class prototypes w.r.t. the composition information, which can enhance sub-class discrimination at the classifier level. We conduct extensive experiments on the challenging benchmark datasets, and the considerable performance improvement over state-of-the-art approaches is achieved, which indicates the effectiveness of our method. Our code is available at https://github.com/hxm97/SCD-CZSL.

PDF Details DOI

ICML Conference 2022 Conference Paper

Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring

Zhengquan Luo
Yunlong Wang 0003
Zilei Wang
Zhenan Sun
Tieniu Tan

Attributes skew hinders the current federated learning (FL) frameworks from consistent optimization directions among the clients, which inevitably leads to performance reduction and unstable convergence. The core problems lie in that: 1) Domain-specific attributes, which are non-causal and only locally valid, are indeliberately mixed into global aggregation. 2) The one-stage optimizations of entangled attributes cannot simultaneously satisfy two conflicting objectives, i. e. , generalization and personalization. To cope with these, we proposed disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches, which are trained by the proposed alternating local-global optimization independently. Importantly, convergence analysis proves that the FL system can be stably converged even if incomplete client models participate in the global aggregation, which greatly expands the application scope of FL. Extensive experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods on both manually synthesized and realistic attributes skew datasets.

Details

AAAI Conference 2021 Conference Paper

Efficient License Plate Recognition via Holistic Position Attention

Yesheng Zhang
Zilei Wang
Jiafan Zhuang

License plate recognition (LPR) is a fundamental component of various intelligent transportation systems, and is always expected to be accurate and efficient enough in realworld applications. Nowadays, recognition of single character has been sophisticated benefiting from the power of deep learning, and extracting position information for forming a character sequence becomes the main bottleneck of LPR. To tackle this issue, we propose a novel holistic position attention (HPA) in this paper that consists of position network and shared classifier. Specifically, the position network explicitly encodes the character position into the maps of HPA, and then the shared classifier performs the character recognition in a unified and parallel way. Here the extracted features are modulated by the attention maps before feeding into the classifier to yield the final recognition results. Note that our proposed method is end-to-end trainable, character recognition can be concurrently performed, and no post-processing is needed. Thus our LPR system can achieve good effectiveness and efficiency simultaneously. The experimental results on four public datasets, including AOLP, Media Lab, CCPD, and CLPD, well demonstrate the superiority of our method to previous state-of-the-art methods in both accuracy and speed.

PDF Details

IJCAI Conference 2021 Conference Paper

Few-Shot Learning with Part Discovery and Augmentation from Unlabeled Images

Wentao Chen
Chenyang Si
Wei Wang
Liang Wang
Zilei Wang
Tieniu Tan

Few-shot learning is a challenging task since only few instances are given for recognizing an unseen class. One way to alleviate this problem is to acquire a strong inductive bias via meta-learning on similar tasks. In this paper, we show that such inductive bias can be learned from a flat collection of unlabeled images, and instantiated as transferable representations among seen and unseen classes. Specifically, we propose a novel part-based self-supervised representation learning scheme to learn transferable representations by maximizing the similarity of an image to its discriminative part. To mitigate the overfitting in few-shot classification caused by data scarcity, we further propose a part augmentation strategy by retrieving extra images from a base dataset. We conduct systematic studies on miniImageNet and tieredImageNet benchmarks. Remarkably, our method yields impressive results, outperforming the previous best unsupervised methods by 7. 74% and 9. 24% under 5-way 1-shot and 5-way 5-shot settings, which are comparable with state-of-the-art supervised methods.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Learning Intact Features by Erasing-Inpainting for Few-shot Classification

Junjie Li
Zilei Wang
Xiaoming Hu

Few-shot classification aims to categorize the samples from unseen classes with only few labeled samples. To address such a challenge, many methods exploit a base set consisting of massive labeled samples to learn an instance embedding function, i. e. , image feature extractor, and it is expected to possess good transferability among different tasks. Such characteristics of few-shot learning are essentially different from that of traditional image classification only pursuing to get discriminative image representations. In this paper, we propose to learn intact features by erasing-inpainting for fewshot classification. Specifically, we argue that extracting intact features of target objects is more transferable, and then propose a novel cross-set erasing-inpainting (CSEI) method. CSEI processes the images in the support set using erasing and inpainting, and then uses them to augment the query set of the same task. Consequently, the feature embedding produced by our proposed method can contain more complete information of target objects. In addition, we propose taskspecific feature modulation to make the features adaptive to the current task. The extensive experiments on two widely used benchmarks well demonstrates the effectiveness of our proposed method, which can consistently get considerable performance gains for different baseline methods.

PDF Details

AAAI Conference 2020 Conference Paper

Joint Adversarial Learning for Domain Adaptation in Semantic Segmentation

Yixin Zhang
Zilei Wang

Unsupervised domain adaptation in semantic segmentation is to exploit the pixel-level annotated samples in the source domain to aid the segmentation of unlabeled samples in the target domain. For such a task, the key point is to learn domaininvariant representations and adversarial learning is usually used, in which the discriminator is to distinguish which domain the input comes from, and the segmentation model targets to deceive the domain discriminator. In this work, we ﬁrst propose a novel joint adversarial learning (JAL) to boost the domain discriminator in output space by introducing the information of domain discriminator from low-level features. Consequently, the training of the high-level decoder would be enhanced. Then we propose a weight transfer module (WTM) to alleviate the inherent bias of the trained decoder towards source domain. Speciﬁcally, WTM changes the original decoder into a new decoder, which is learned only under the supervision of adversarial loss and thus mainly focuses on reducing domain divergence. The extensive experiments on two widely used benchmarks show that our method can bring considerable performance improvement over different baseline methods, which well demonstrates the effectiveness of our method in the output space adaptation.

PDF Details

AAAI Conference 2020 Conference Paper

Progressive Boundary Refinement Network for Temporal Action Detection

Qinying Liu
Zilei Wang

Temporal action detection is a challenging task due to vagueness of action boundaries. To tackle this issue, we propose an end-to-end progressive boundary refinement network (PBRNet) in this paper. PBRNet belongs to the family of one-stage detectors and is equipped with three cascaded detection modules for localizing action boundary more and more precisely. Specifically, PBRNet mainly consists of coarse pyramidal detection, refined pyramidal detection, and fine-grained detection. The first two modules build two feature pyramids to perform the anchor-based detection, and the third one explores the frame-level features to refine the boundaries of each action instance. In the fined-grained detection module, three frame-level classification branches are proposed to augment the frame-level features and update the confidence scores of action instances. Evidently, PBRNet integrates the anchor-based and frame-level methods. We experimentally evaluate the proposed PBRNet and comprehensively investigate the effect of the main components. The results show PBRNet achieves the state-of-the-art detection performances on two popular benchmarks: THUMOS'14 and ActivityNet, and meanwhile possesses a high inference speed.

PDF Details

IJCAI Conference 2019 Conference Paper

Densely Supervised Hierarchical Policy-Value Network for Image Paragraph Generation

Siying Wu
Zheng-Jun Zha
Zilei Wang
Houqiang Li
Feng Wu

Image paragraph generation aims to describe an image with a paragraph in natural language. Compared to image captioning with a single sentence, paragraph generation provides more expressive and fine-grained description for storytelling. Existing approaches mainly optimize paragraph generator towards minimizing word-wise cross entropy loss, which neglects linguistic hierarchy of paragraph and results in ``sparse" supervision for generator learning. In this paper, we propose a novel Densely Supervised Hierarchical Policy-Value (DHPV) network for effective paragraph generation. We design new hierarchical supervisions consisting of hierarchical rewards and values at both sentence and word levels. The joint exploration of hierarchical rewards and values provides dense supervision cues for learning effective paragraph generator. We propose a new hierarchical policy-value architecture which exploits compositionality at token-to-token and sentence-to-sentence levels simultaneously and can preserve the semantic and syntactic constituent integrity. Extensive experiments on the Stanford image-paragraph benchmark have demonstrated the effectiveness of the proposed DHPV approach with performance improvements over multiple state-of-the-art methods.

PDF Details

AAAI Conference 2019 Conference Paper

Weighted Channel Dropout for Regularization of Deep Convolutional Neural Network

Saihui Hou
Zilei Wang

In this work, we propose a novel method named Weighted Channel Dropout (WCD) for the regularization of deep Convolutional Neural Network (CNN). Different from Dropout which randomly selects the neurons to set to zero in the fully-connected layers, WCD operates on the channels in the stack of convolutional layers. Specifically, WCD consists of two steps, i. e. , Rating Channels and Selecting Channels, and three modules, i. e. , Global Average Pooling, Weighted Random Selection and Random Number Generator. It filters the channels according to their activation status and can be plugged into any two consecutive layers, which unifies the original Dropout and Channel-Wise Dropout. WCD is totally parameter-free and deployed only in training phase with very slight computation cost. The network in test phase remains unchanged and thus the inference cost is not added at all. Besides, when combining with the existing networks, it requires no re-pretraining on ImageNet and thus is well-suited for the application on small datasets. Finally, WCD with VGGNet- 16, ResNet-101, Inception-V3 are experimentally evaluated on multiple datasets. The extensive results demonstrate that WCD can bring consistent improvements over the baselines.

PDF Details

AAAI Conference 2018 Conference Paper

Lateral Inhibition-Inspired Convolutional Neural Network for Visual Attention and Saliency Detection

Chunshui Cao
Yongzhen Huang
Zilei Wang
Liang Wang
Ninglong Xu
Tieniu Tan

Lateral inhibition in top-down feedback is widely existing in visual neurobiology, but such an important mechanism has not be well explored yet in computer vision. In our recent research, we ﬁnd that modeling lateral inhibition in convolutional neural network (LICNN) is very useful for visual attention and saliency detection. In this paper, we propose to formulate lateral inhibition inspired by the related studies from neurobiology, and embed it into the top-down gradient computation of a general CNN for classiﬁcation, i. e. only category-level information is used. After this operation (only conducted once), the network has the ability to generate accurate category-speciﬁc attention maps. Further, we apply LICNN for weakly-supervised salient object detection. Extensive experimental studies on a set of databases, e. g. , EC- SSD, HKU-IS, PASCAL-S and DUT-OMRON, demonstrate the great advantage of LICNN which achieves the state-ofthe-art performance. It is especially impressive that LICNN with only category-level supervised information even outperforms some recent methods with segmentation-level supervised learning.

PDF Details