Author name cluster

Yanyun Qu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

33 papers

2 author rows

AAAI Conference 2026 Conference Paper

BeyondSparse: Facilitating Mamba to Enhance Cross-Domain 3D Semantic Segmentation in Adverse Weather

Yao Wu
Mingwei Xing
Yachao Zhang
Fangyong Wang
Xiaopei Zhang
Yanyun Qu

Domain generalization (DG) and domain adaptation (DA) for 3D semantic segmentation enable the model to maintain high performance while avoiding labor-intensive and time-consuming annotation of target-domain data. However, under adverse weather conditions, the injection of spatial noise will affect the reflectivity of LiDAR point clouds, exacerbate domain distribution discrepancies, and degrade the generalization ability of the model. Current methods mainly rely on sparse convolution-based architecture. Due to its limited receptive field, the model captures varying local geometric information when dealing with point clouds of different sparsities, thereby limiting its transferability. To this end, we propose BeyondSparse, a novel cross-domain 3D semantic segmentation method under adverse weather that incorporates a state-space model into a 3D sparse convolution-based architecture, sequentially modeling all features to learn domain-invariant representations. This method consists of two main components: domain feature decoupling and Mamba-based encoder. The former performs feature disentanglement before sequential modeling, while the latter performs global modeling on voxelized point cloud data. In addition, we introduce a token-style augmentation to capture the intrinsic properties of input data. Extensive experimental results demonstrate that our method outperforms SOTA competitors in both DG and DA tasks, for instance, achieving +4.6% and +0.8% mIoU on ``SynLiDAR to SemanticSTF''.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Diffusion Once and Done: Degradation-Aware LoRA for All-in-One Image Restoration

Ni Tang
Xiaotong Luo
Zihan Cheng
Liangtai Zhou
Dongxiao Zhang
Yanyun Qu

Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.

PDF Details DOI

AAAI Conference 2026 Conference Paper

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Wenbin Tan
Jiawen Lin
Fangyong Wang
Yuan Xie
Yong Xie
Yachao Zhang
Yanyun Qu

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. To address the scale disparity and conflicting gradients in joint 3DREC–3DRES training, we propose L_DGTL, a unified loss function that explicitly reduces multi-task crosstalk and enables effective parameter sharing across tasks. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the [email protected] score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SpikingIR: A Novel Converted Spiking Neural Network for Efficient Image Restoration

Yang Ouyang
Zihan Cheng
Xiaotong Luo
Guoqi Li
Yanyun Qu

Image restoration has made great progress with the rise of deep learning, but its energy consumption limits its real-world applications. Spiking Neural Networks (SNNs) are seen as energy-efficient alternatives to Artificial Neural Networks (ANNs). Applying SNNs to image restoration (IR) remains challenging, primarily due to the limited information capacity of spike-based signals. This limitation leads to quantization errors and information loss, while IR tasks are highly sensitive to output precision and error. Thus, the restoration performance suffers significantly. To address this challenge, we propose SpikingIR, an ANN-to-SNN conversion framework for IR that reduces information loss and quantization error. SpikingIR mainly consists of two components: Convolutional Pixel Mapping (CPM) and Membrane Potential Reuse Neuron (MPRN), which are designed to alleviate quantization errors and information loss in the output and intermediate layers, respectively. Specifically, CPM maps discrete outputs into a continuous space, better aligning with pixel-level details. From the perspective of information entropy, we show that outputs of CPM contain more information than the original outputs. MPRN introduces a post-processing step with relaxed firing conditions to extract residual membrane potential, reducing information waste. Furthermore, we fine-tune the converted model to jointly optimize both accuracy and energy efficiency. Experimental results demonstrate that SpikingIR achieves performance comparable to ANN counterparts across various IR benchmarks while reducing energy consumption by up to 50%.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li
Yang Lu
Yachao Zhang
Yong Xie
Fangyong Wang
Yuan Xie
Yanyun Qu

Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP’s vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose Refocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

PDF Details DOI

AAAI Conference 2026 Conference Paper

xMHashSeg: Cross-modal Hash Learning for Training-free Unsupervised LiDAR Semantic Segmentation

Jialong Zhang
Yachao Zhang
Yao Wu
Jiangming Shi
Fangyong Wang
Yanyun Qu

3D semantic segmentation serves as a fundamental component in many applications, such as autonomous driving and medical image analysis. Although recent methods have advanced the field, adapting these methods to new environments or object categories without extensive retraining remains a significant challenge. To address this, we introduce xMHashSeg, a novel training-free cross-modal LiDAR semantic segmentation framework. xMHashSeg leverages foundation models and non-parametric network to extract features from 2D images and 3D point clouds, subsequently integrating these features through hash learning. Specifically, We develop point-SANN, a novel self-adaption non-parametric network that can extract robust 3D features from raw point clouds, while 2D features are directly extracted through the foundation model DINOv2. To reconcile inconsistencies across different modals, we introduce a Hash Code Learning Module that projects all information into a common hash space, learning a consistent hash code that enhances feature integration. Additionally, depth maps are utilized as an intermediary form between 2D and 3D data to facilitate convergence during hash code learning. Our experimental results on various multi-modality datasets demonstrate that xMHashSeg outperforms zero-shot learning approaches and achieve performance close to that of unsupervised domain adaptation and test-time adaptation methods, without requiring any annotations or additional training.

PDF Details DOI

ICML Conference 2025 Conference Paper

Large Continual Instruction Assistant

Jingyang Qiao
Zhizhong Zhang 0001
Xin Tan 0002
Yanyun Qu
Shouhong Ding
Yuan Xie 0006

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https: //github. com/JingyangQiao/CoIN.

AAAI Conference 2025 Conference Paper

MaskViM: Domain Generalized Semantic Segmentation with State Space Models

Jiahao Li
Yang Lu
Yuan Xie
Yanyun Qu

Domain Generalized Semantic Segmentation (DGSS) aims to utilize segmentation model training on known source domains to make predictions on unknown target domains. Currently, there are two network architectures: one based on Convolutional Neural Networks (CNNs) and the other based on Visual Transformers (ViTs). However, both CNN-based and ViT-based DGSS methods face challenges: the former lacks a global receptive field, while the latter requires more computational demands. Drawing inspiration from State Space Models (SSMs), which not only possess a global receptive field but also maintain linear complexity, we propose SSM-based method for achieving DGSS. In this work, we first elucidate why does mask make sense in SSM-based DGSS and propose our mask learning mechanism. Leveraging this mechanism, we present our Mask Vision Mamba network (MaskViM), a model for SSM-based DGSS, and design our mask loss to optimize MaskViM. Our method achieves superior performance on four diverse DGSS setting, which demonstrates the effectiveness of our method.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Omni-Query Active Learning for Source-Free Domain Adaptive Cross-Modality 3D Semantic Segmentation

Jianxiang Xie
Yao Wu
Yachao Zhang
Zhongchao Shi
Jianping Fan
Yuan Xie
Yanyun Qu

Source-Free Domain Adaptation (SFDA) aims to transfer a pre-trained source model to the unlabeled target domain without accessing the source data, thereby effectively solving labeled data dependency and domain shift problems. However, the SFDA setting faces a bottleneck due to the absence of supervisory information. To mitigate this problem, Active Learning (AL) is introduced to combine with SFDA, endeavoring to actively label a small set of the most high-quality target points so that models with satisfactory performance can be obtained at an acceptable cost. Nevertheless, several issues remain unresolved, namely when to query new labels during training, what kind of samples deserve labeling to ensure rich information, and where the labels should be distributed to guarantee diversity. Thus we elaborate OmniQuery to omnibearing address the “When, What, and Where” problems about active points querying in source-free domain adaptation for cross-modal 3D semantic segmentation. The method consists of three main components: Query Decider, Point Ranker, and Budget Slicer. The Query Decider determines the optimal timing to query new points by fitting the validation curves during training. The Point Ranker nominates points for annotation by calculating the ambiguity of neighboring points in the feature space. The Budget Slicer allocates the annotation quota, i.e., labeling percentage of the point cloud, to different semantic regions by utilizing the advanced 2D semantic segmentation capabilities of the Segment Anything Model (SAM). Extensive experiments demonstrate the effectiveness of our proposed method, achieving up to 99.64% of fully supervised performance with only 3% of labels, and consistently outperforming comparison methods across various scenarios.

PDF Details DOI

AAAI Conference 2024 Conference Paper

AdaFormer: Efficient Transformer with Adaptive Token Sparsification for Image Super-resolution

Xiaotong Luo
Zekun Ai
Qiuyuan Liang
Ding Liu
Yuan Xie
Yanyun Qu
Yun Fu

Efficient transformer-based models have made remarkable progress in image super-resolution (SR). Most of these works mainly design elaborate structures to accelerate the inference of the transformer, where all feature tokens are propagated equally. However, they ignore the underlying characteristic of image content, i.e., various image regions have distinct restoration difficulties, especially for large images (2K-8K), failing to achieve adaptive inference. In this work, we propose an adaptive token sparsification transformer (AdaFormer) to speed up the model inference for image SR. Specifically, a texture-relevant sparse attention block with parallel global and local branches is introduced, aiming to integrate informative tokens from the global view instead of only in fixed local windows. Then, an early-exit strategy is designed to progressively halt tokens according to the token importance. To estimate the plausibility of each token, we adopt a lightweight confidence estimator, which is constrained by an uncertainty-guided loss to obtain a binary halting mask about the tokens. Experiments on large images have illustrated that our proposal reduces nearly 90% latency against SwinIR on Test8K, while maintaining a comparable performance.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Beyond the Label Itself: Latent Labels Enhance Semi-supervised Point Cloud Panoptic Segmentation

Yujun Chen
Xin Tan
Zhizhong Zhang
Yanyun Qu
Yuan Xie

As the exorbitant expense of labeling autopilot datasets and the growing trend of utilizing unlabeled data, semi-supervised segmentation on point clouds becomes increasingly imperative. Intuitively, finding out more ``unspoken words'' (i.e., latent instance information) beyond the label itself should be helpful to improve performance. In this paper, we discover two types of latent labels behind the displayed label embedded in LiDAR and image data. First, in the LiDAR Branch, we propose a novel augmentation, Cylinder-Mix, which is able to augment more yet reliable samples for training. Second, in the Image Branch, we propose the Instance Position-scale Learning (IPSL) Module to learn and fuse the information of instance position and scale, which is from a 2D pre-trained detector and a type of latent label obtained from 3D to 2D projection. Finally, the two latent labels are embedded into the multi-modal panoptic segmentation network. The ablation of the IPSL module demonstrates its robust adaptability, and the experiments evaluated on SemanticKITTI and nuScenes demonstrate that our model outperforms the state-of-the-art method, LaserMix.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

CLIP-FSAC: Boosting CLIP for Few-Shot Anomaly Classification with Synthetic Anomalies

Zuo Zuo
Yao Wu
Baoqiang Li
Jiahao Dong
You Zhou
Lei Zhou
Yanyun Qu
Zongze Wu

Few-shot anomaly classification (FSAC) is a vital task in manufacturing industry. Recent methods focus on utilizing CLIP in zero/few normal shot anomaly detection instead of custom models. However, there is a lack of specific text prompts in anomaly classification and most of them ignore the modality gap between image and text. Meanwhile, there is distribution discrepancy between the pre-trained and the target data. To provide a remedy, in this paper, we propose a method to boost CLIP for few-normal-shot anomaly classification, dubbed CLIP-FSAC, which contains two-stage of training and alternating fine-tuning with two modality-specific adapters. Specifically, in the first stage, we train image adapter with text representation output from text encoder and introduce an image-to-text tuning to enhance multi-modal interaction and facilitate a better language-compatible visual representation. In the second stage, we freeze the image adapter to train the text adapter. Both of them are constrained by fusion-text contrastive loss. Comprehensive experiment results are provided for evaluating our method in few-normal-shot anomaly classification, which outperforms the state-of-the-art method by 12. 2%, 10. 9%, 10. 4% AUROC on VisA for 1, 2, and 4-shot settings.

PDF Details DOI

AAAI Conference 2024 Conference Paper

CLIP-Guided Federated Learning on Heterogeneity and Long-Tailed Data

Jiangming Shi
Shanshan Zheng
Xiangbo Yin
Yang Lu
Yuan Xie
Yanyun Qu

Federated learning (FL) provides a decentralized machine learning paradigm where a server collaborates with a group of clients to learn a global model without accessing the clients' data. User heterogeneity is a significant challenge for FL, which together with the class-distribution imbalance further enhances the difficulty of FL. Great progress has been made in large vision-language models, such as Contrastive Language-Image Pre-training (CLIP), which paves a new way for image classification and object recognition. Inspired by the success of CLIP on few-shot and zero-shot learning, we use CLIP to optimize the federated learning between server and client models under its vision-language supervision. It is promising to mitigate the user heterogeneity and class-distribution balance due to the powerful cross-modality representation and rich open-vocabulary prior knowledge. In this paper, we propose the CLIP-guided FL (CLIP2FL) method on heterogeneous and long-tailed data. In CLIP2FL, the knowledge of the off-the-shelf CLIP model is transferred to the client-server models, and a bridge is built between the client and server. Specifically, for client-side learning, knowledge distillation is conducted between client models and CLIP to improve the ability of client-side feature representation. For server-side learning, in order to mitigate the heterogeneity and class-distribution imbalance, we generate federated features to retrain the server model. A prototype contrastive learning with the supervision of the text encoder of CLIP is introduced to generate federated features depending on the client-side gradients, and they are used to retrain a balanced server classifier. Extensive experimental results on several benchmarks demonstrate that CLIP2FL achieves impressive performance and effectively deals with data heterogeneity and long-tail distribution. The code is available at https://github.com/shijiangming1/CLIP2FL.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Cross-Modal Match for Language Conditioned 3D Object Grounding

Yachao Zhang
Runze Hu
Ronghui Li
Yanyun Qu
Yuan Xie
Xiu Li

Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Efficient Lightweight Image Denoising with Triple Attention Transformer

Yubo Zhou
Jin Lin
Fangchen Ye
Yanyun Qu
Yuan Xie

Transformer has shown outstanding performance on image denoising, but the existing Transformer methods for image denoising are with large model sizes and high computational complexity, which is unfriendly to resource-constrained devices. In this paper, we propose a Lightweight Image Denoising Transformer method (LIDFormer) based on Triple Multi-Dconv Head Transposed Attention (TMDTA) to boost computational efficiency. LIDFormer first implements Discrete Wavelet Transform (DWT), which transforms the input image into a low-frequency space, greatly reducing the computational complexity of image denoising. However, the low-frequency image lacks fine-feature information, which degrades the denoising performance. To handle this problem, we introduce the Complementary Periodic Feature Reusing (CPFR) scheme for aggregating the shallow-layer features and the deep-layer features. Furthermore, TMDTA is proposed to integrate global context along three dimensions, thereby enhancing the ability of global feature representation. Note that our method can be applied as a pipeline for both convolutional neural networks and Transformers. Extensive experiments on several benchmarks demonstrate that the proposed LIDFormer achieves a better trade-off between high performance and low computational complexity on real-world image denoising tasks.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Learning Commonality, Divergence and Variety for Unsupervised Visible-Infrared Person Re-identification

Jiangming Shi
Xiangbo Yin
Yachao Zhang
Zhizhong Zhang
Yuan Xie
Yanyun Qu

Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified persons in infrared images to visible images without annotations, and vice versa. USVI-ReID is a challenging yet underexplored task. Most existing methods address the USVI-ReID through cluster-based contrastive learning, which simply employs the cluster center to represent an individual. However, the cluster center primarily focuses on commonality, overlooking divergence and variety. To address the problem, we propose a Progressive Contrastive Learning with Hard and Dynamic Prototypes for USVI-ReID. In brief, we generate the hard prototype by selecting the sample with the maximum distance from the cluster center. We reveal that the inclusion of the hard prototype in contrastive loss helps to emphasize divergence. Additionally, instead of rigidly aligning query images to a specific prototype, we generate the dynamic prototype by randomly picking samples within a cluster. The dynamic prototype is used to encourage variety. Finally, we introduce a progressive learning strategy to gradually shift the model's attention towards divergence and variety, avoiding cluster deterioration. Extensive experiments conducted on the publicly available SYSU-MM01 and RegDB datasets validate the effectiveness of the proposed method.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Learning Task-Aware Language-Image Representation for Class-Incremental Object Detection

Hongquan Zhang
Bin-Bin Gao
Yi Zeng
Xudong Tian
Xin Tan
Zhizhong Zhang
Yanyun Qu
Jun Liu

Class-incremental object detection (CIOD) is a real-world desired capability, requiring an object detector to continuously adapt to new tasks without forgetting learned ones, with the main challenge being catastrophic forgetting. Many methods based on distillation and replay have been proposed to alleviate this problem. However, they typically learn on a pure visual backbone, neglecting the powerful representation capabilities of textual cues, which to some extent limits their performance. In this paper, we propose task-aware language-image representation to mitigate catastrophic forgetting, introducing a new paradigm for language-image-based CIOD. First of all, we demonstrate the significant advantage of language-image detectors in mitigating catastrophic forgetting. Secondly, we propose a learning task-aware language-image representation method that overcomes the existing drawback of directly utilizing the language-image detector for CIOD. More specifically, we learn the language-image representation of different tasks through an insulating approach in the training stage, while using the alignment scores produced by task-specific language-image representation in the inference stage. Through our proposed method, language-image detectors can be more practical for CIOD. We conduct extensive experiments on COCO 2017 and Pascal VOC 2007 and demonstrate that the proposed method achieves state-of-the-art results under the various CIOD settings.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Prompt Gradient Projection for Continual Learning

Jingyang Qiao
Zhizhong Zhang 0001
Xin Tan 0002
Chengwei Chen
Yanyun Qu
Yong Peng 0002
Yuan Xie 0006

Prompt-tuning has demonstrated impressive performance in continual learning by querying relevant prompts for each input instance, which can avoid the introduction of task identifier. Its forgetting is therefore reduced as this instance-wise query mechanism enables us to select and update only relevant prompts. In this paper, we further integrate prompt-tuning with gradient projection approach. Our observation is: prompt-tuning releases the necessity of task identifier for gradient projection method; and gradient projection provides theoretical guarantees against forgetting for prompt-tuning. This inspires a new prompt gradient projection approach (PGP) for continual learning. In PGP, we deduce that reaching the orthogonal condition for prompt gradient can effectively prevent forgetting via the self-attention mechanism in vision-transformer. The condition equations are then realized by conducting Singular Value Decomposition (SVD) on an element-wise sum space between input space and prompt space. We validate our method on diverse datasets and experiments demonstrate the efficiency of reducing forgetting both in class incremental, online class incremental, and task incremental settings. The code is available at https://github.com/JingyangQiao/prompt-gradient-projection.

NeurIPS Conference 2024 Conference Paper

Relationship Prompt Learning is Enough for Open-Vocabulary Semantic Segmentation

Jiahao Li
Yang Lu
Yuan Xie
Yanyun Qu

Open-vocabulary semantic segmentation (OVSS) aims to segment unseen classes without corresponding labels. Existing Vision-Language Model (VLM)-based methods leverage VLM's rich knowledge to enhance additional explicit segmentation-specific networks, yielding competitive results, but at the cost of extensive training cost. To reduce the cost, we attempt to enable VLM to directly produce the segmentation results without any segmentation-specific networks. Prompt learning offers a direct and parameter-efficient approach, yet it falls short in guiding VLM for pixel-level visual classification. Therefore, we propose the ${\bf R}$elationship ${\bf P}$rompt ${\bf M}$odule (${\bf RPM}$), which generates the relationship prompt that directs VLM to extract pixel-level semantic embeddings suitable for OVSS. Moreover, RPM integrates with VLM to construct the ${\bf R}$elationship ${\bf P}$rompt ${\bf N}$etwork (${\bf RPN}$), achieving OVSS without any segmentation-specific networks. RPN attains state-of-the-art performance with merely about ${\bf 3M}$ trainable parameters (2\% of total parameters).

PDF Details DOI

AAAI Conference 2024 Conference Paper

SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution

Xiaotong Luo
Yuan Xie
Yanyun Qu
Yun Fu

It is well-known that image quality assessment usually meets with the problem of perception-distortion (p-d) tradeoff. The existing deep image super-resolution (SR) methods either focus on high fidelity with pixel-level objectives or high perception with generative models. The emergence of diffusion model paves a fresh way for image restoration, which has the potential to offer a brand-new solution for p-d trade-off. We experimentally observed that the perceptual quality and distortion change in an opposite direction with the increase of sampling steps. In light of this property, we propose an adaptive skip diffusion model (SkipDiff), which aims to achieve high-fidelity perceptual image SR with fewer sampling steps. Specifically, it decouples the sampling procedure into coarse skip approximation and fine skip refinement stages. A coarse-grained skip diffusion is first performed as a high-fidelity prior to obtaining a latent approximation of the full diffusion. Then, a fine-grained skip diffusion is followed to further refine the latent sample for promoting perception, where the fine time steps are adaptively learned by deep reinforcement learning. Meanwhile, this approach also enables faster sampling of diffusion model through skipping the intermediate denoising process to shorten the effective steps of the computation. Extensive experimental results show that our SkipDiff achieves superior perceptual quality with plausible reconstruction accuracy and a faster sampling speed.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

UniDSeg: Unified Cross-Domain 3D Semantic Segmentation via Visual Foundation Models Prior

Yao Wu
Mingwei Xing
Yachao Zhang
Xiaotong Luo
Yuan Xie
Yanyun Qu

3D semantic segmentation using an adapting model trained from a source domain with or without accessing unlabeled target-domain data is the fundamental task in computer vision, containing domain adaptation and domain generalization. The essence of simultaneously solving cross-domain tasks is to enhance the generalizability of the encoder. In light of this, we propose a groundbreaking universal method with the help of off-the-shelf Visual Foundation Models (VFMs) to boost the adaptability and generalizability of cross-domain 3D semantic segmentation, dubbed $\textbf{UniDSeg}$. Our method explores the VFMs prior and how to harness them, aiming to inherit the recognition ability of VFMs. Specifically, this method introduces layer-wise learnable blocks to the VFMs, which hinges on alternately learning two representations during training: (i) Learning visual prompt. The 3D-to-2D transitional prior and task-shared knowledge is captured from the prompt space, and then (ii) Learning deep query. Spatial Tunability is constructed to the representation of distinct instances driven by prompts in the query space. Integrating these representations into a cross-modal learning framework, UniDSeg efficiently mitigates the domain gap between 2D and 3D modalities, achieving unified cross-domain 3D semantic segmentation. Extensive experiments demonstrate the effectiveness of our method across widely recognized tasks and datasets, all achieving superior performance over state-of-the-art methods. Remarkably, UniDSeg achieves 57. 5\%/54. 4\% mIoU on ``A2D2/sKITTI'' for domain adaptive/generalized tasks. Code is available at https: //github. com/Barcaaaa/UniDSeg.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Learning Re-sampling Methods with Parameter Attribution for Image Super-resolution

Xiaotong Luo
Yuan Xie
Yanyun Qu

Single image super-resolution (SISR) has made a significant breakthrough benefiting from the prevalent rise of deep neural networks and large-scale training samples. The mainstream deep SR models primarily focus on network architecture design as well as optimization schemes, while few pay attention to the training data. In fact, most of the existing SR methods train the model on uniformly sampled patch pairs from the whole image. However, the uneven image content makes the training data present an unbalanced distribution, i. e. , the easily reconstructed region (smooth) occupies the majority of the data, while the hard reconstructed region (edge or texture) has rarely few samples. Based on this phenomenon, we consider rethinking the current paradigm of merely using uniform data sampling way for training SR models. In this paper, we propose a simple yet effective Bi-Sampling Parameter Attribution (BSPA) method for accurate image SR. Specifically, the bi-sampling consists of uniform sampling and inverse sampling, which is introduced to reconcile the unbalanced inherent data bias. The former aims to keep the intrinsic data distribution, and the latter is designed to enhance the feature extraction ability of the model on the hard samples. Moreover, integrated gradient is introduced to attribute the contribution of each parameter in the alternate models trained by both sampling data so as to filter the trivial parameters for further dynamic refinement. By progressively decoupling the allocation of parameters, the SR model can learn a more compact representation. Extensive experiments on publicly available datasets demonstrate that our proposal can effectively boost the performance of baseline methods from the data re-sampling view.

IJCAI Conference 2023 Conference Paper

VS-Boost: Boosting Visual-Semantic Association for Generalized Zero-Shot Learning

Xiaofan Li
Yachao Zhang
Shiran Bian
Yanyun Qu
Yuan Xie
Zhongchao Shi
Jianping Fan

Unlike conventional zero-shot learning (CZSL) which only focuses on the recognition of unseen classes by using the classifier trained on seen classes and semantic embeddings, generalized zero-shot learning (GZSL) aims at recognizing both the seen and unseen classes, so it is more challenging due to the extreme training imbalance. Recently, some feature generation methods introduce metric learning to enhance the discriminability of visual features. Although these methods achieve good results, they focus only on metric learning in the visual feature space to enhance features and ignore the association between the feature space and the semantic space. Since the GZSL method uses semantics as prior knowledge to migrate visual knowledge to unseen classes, the consistency between visual space and semantic space is critical. To this end, we propose relational metric learning which can relate the metrics in the two spaces and make the distribution of the two spaces more consistent. Based on the generation method and relational metric learning, we proposed a novel GZSL method, termed VS-Boost, which can effectively boost the association between vision and semantics. The experimental results demonstrate that our method is effective and achieves significant gains on five benchmark datasets compared with the state-of-the-art methods.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Weakly Supervised 3D Segmentation via Receptive-Driven Pseudo Label Consistency and Structural Consistency

Yuxiang Lan
Yachao Zhang
Yanyun Qu
Cong Wang
Chengyang Li
Jia Cai
Yuan Xie
Zongze Wu

As manual point-wise label is time and labor-intensive for fully supervised large-scale point cloud semantic segmentation, weakly supervised method is increasingly active. However, existing methods fail to generate high-quality pseudo labels effectively, leading to unsatisfactory results. In this paper, we propose a weakly supervised point cloud semantic segmentation framework via receptive-driven pseudo label consistency and structural consistency to mine potential knowledge. Specifically, we propose three consistency contrains: pseudo label consistency among different scales, semantic structure consistency between intra-class features and class-level relation structure consistency between pair-wise categories. Three consistency constraints are jointly used to effectively prepares and utilizes pseudo labels simultaneously for stable training. Finally, extensive experimental results on three challenging datasets demonstrate that our method significantly outperforms state-of-the-art weakly supervised methods and even achieves comparable performance to the fully supervised methods.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Comprehensive Regularization in a Bi-directional Predictive Network for Video Anomaly Detection

Chengwei Chen
Yuan Xie
Shaohui Lin
Angela Yao
Guannan Jiang
Wei Zhang
Yanyun Qu
Ruizhi Qiao

Video anomaly detection aims to automatically identify unusual objects or behaviours by learning from normal videos. Previous methods tend to use simplistic reconstruction or prediction constraints, which leads to the insufficiency of learned representations for normal data. As such, we propose a novel bi-directional architecture with three consistency constraints to comprehensively regularize the prediction task from pixelwise, cross-modal, and temporal-sequence levels. First, predictive consistency is proposed to consider the symmetry property of motion and appearance in forwards and backwards time, which ensures the highly realistic appearance and motion predictions at the pixel-wise level. Second, association consistency considers the relevance between different modalities and uses one modality to regularize the prediction of another one. Finally, temporal consistency utilizes the relationship of the video sequence and ensures that the predictive network generates temporally consistent frames. During inference, the pattern of abnormal frames is unpredictable and will therefore cause higher prediction errors. Experiments show that our method outperforms advanced anomaly detectors and achieves state-of-the-art results on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.

AAAI Conference 2022 Conference Paper

Uncertainty-Driven Dehazing Network

Ming Hong
Jianzhuang Liu
Cuihua Li
Yanyun Qu

Deep learning has made remarkable achievements for single image haze removal. However, existing deep dehazing models only give deterministic results without discussing their uncertainty. There exist two types of uncertainty in the dehazing models: aleatoric uncertainty that comes from noise inherent in the observations and epistemic uncertainty that accounts for uncertainty in the model. In this paper, we propose a novel uncertainty-driven dehazing network (UDN) that improves the dehazing results by exploiting the relationship between the uncertain and confident representations. We first introduce an Uncertainty Estimation Block (UEB) to predict the aleatoric and epistemic uncertainty together. Then, we propose an Uncertainty-aware Feature Modulation (UFM) block to adaptively enhance the learned features. UFM predicts a convolution kernel and channel-wise modulation coefficients conditioned on the uncertainty weighted representation. Moreover, we develop an uncertainty-driven self-distillation loss to improve the uncertain representation by transferring the knowledge from the confident one. Extensive experimental results on synthetic datasets and real-world images show that UDN achieves significant quantitative and qualitative improvements, outperforming state-of-the-arts.

AAAI Conference 2021 Conference Paper

Boundary-Aware Geometric Encoding for Semantic Segmentation of Point Clouds

Jingyu Gong
Jiachen Xu
Xin Tan
Jie Zhou
Yanyun Qu
Yuan Xie
Lizhuang Ma

Boundary information plays a significant role in 2D image segmentation, while usually being ignored in 3D point cloud segmentation where ambiguous features might be generated in feature extraction, leading to misclassification in the transition area between two objects. In this paper, firstly, we propose a Boundary Prediction Module (BPM) to predict boundary points. Based on the predicted boundary, a boundaryaware Geometric Encoding Module (GEM) is designed to encode geometric information and aggregate features with discrimination in a neighborhood, so that the local features belonging to different categories will not be polluted by each other. To provide extra geometric information for boundaryaware GEM, we also propose a light-weight Geometric Convolution Operation (GCO), making the extracted features more distinguishing. Built upon the boundary-aware GEM, we build our network and test it on benchmarks like ScanNet v2, S3DIS. Results show our methods can significantly improve the baseline and achieve state-of-the-art performance.

IJCAI Conference 2021 Conference Paper

Learn from Concepts: Towards the Purified Memory for Few-shot Learning

Xuncheng Liu
Xudong Tian
Shaohui Lin
Yanyun Qu
Lizhuang Ma
Wang Yuan
Zhizhong Zhang
Yuan Xie

Human beings have a great generalization ability to recognize a novel category by only seeing a few number of samples. This is because humans possess the ability to learn from the concepts that already exist in our minds. However, many existing few-shot approaches fail in addressing such a fundamental problem, {\it i. e. ,} how to utilize the knowledge learned in the past to improve the prediction for the new task. In this paper, we present a novel purified memory mechanism that simulates the recognition process of human beings. This new memory updating scheme enables the model to purify the information from semantic labels and progressively learn consistent, stable, and expressive concepts when episodes are trained one by one. On its basis, a Graph Augmentation Module (GAM) is introduced to aggregate these concepts and knowledge learned from new tasks via a graph neural network, making the prediction more accurate. Generally, our approach is model-agnostic and computing efficient with negligible memory cost. Extensive experiments performed on several benchmarks demonstrate the proposed method can consistently outperform a vast number of state-of-the-art few-shot learning methods.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Self-boosting for Feature Distillation

Yulong Pei
Yanyun Qu
Junping Zhang

Knowledge distillation is a simple but effective method for model compression, which obtains a better-performing small network (Student) by learning from a well-trained large network (Teacher). However, when the difference in the model sizes of Student and Teacher is large, the gap in capacity leads to poor performance of Student. Existing methods focus on seeking simplified or more effective knowledge from Teacher to narrow the Teacher-Student gap, while we address this problem by Student's self-boosting. Specifically, we propose a novel distillation method named Self-boosting Feature Distillation (SFD), which eases the Teacher-Student gap by feature integration and self-distillation of Student. Three different modules are designed for feature integration to enhance the discriminability of Student's feature, which leads to improving the order of convergence in theory. Moreover, an easy-to-operate self-distillation strategy is put forward to stabilize the training process and promote the performance of Student, without additional forward propagation or memory consumption. Extensive experiments on multiple benchmarks and networks show that our method is significantly superior to existing methods.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

Towards Compact Single Image Super-Resolution via Contrastive Self-distillation

Yanbo Wang
Shaohui Lin
Yanyun Qu
Haiyan Wu
Zhizhong Zhang
Yuan Xie
Angela Yao

Convolutional neural networks (CNNs) are highly successful for super-resolution (SR) but often require sophisticated architectures with heavy memory cost and computational overhead significantly restricts their practical deployments on resource-limited devices. In this paper, we proposed a novel contrastive self-distillation (CSD) framework to simultaneously compress and accelerate various off-the-shelf SR models. In particular, a channel-splitting super-resolution network can first be constructed from a target teacher network as a compact student network. Then, we propose a novel contrastive loss to improve the quality of SR images and PSNR/SSIM via explicit knowledge transfer. Extensive experiments demonstrate that the proposed CSD scheme effectively compresses and accelerates several standard SR models such as EDSR, RCAN and CARN. Code is available at https: //github. com/Booooooooooo/CSD.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Weakly Supervised Semantic Segmentation for Large-Scale Point Cloud

Yachao Zhang
Zonghao Li
Yuan Xie
Yanyun Qu
Cuihua Li
Tao Mei

Existing methods for large-scale point cloud semantic segmentation require expensive, tedious and error-prone manual point-wise annotations. Intuitively, weakly supervised training is a direct solution to reduce the cost of labeling. However, for weakly supervised large-scale point cloud semantic segmentation, too few annotations will inevitably lead to ineffective learning of network. We propose an effective weakly supervised method containing two components to solve the above problem. Firstly, we construct a pretext task, i. e. , point cloud colorization, with a self-supervised learning to transfer the learned prior knowledge from a large amount of unlabeled point cloud to a weakly supervised network. In this way, the representation capability of the weakly supervised network can be improved by the guidance from a heterogeneous task. Besides, to generate pseudo label for unlabeled data, a sparse label propagation mechanism is proposed with the help of generated class prototypes, which is used to measure the classification confidence of unlabeled point. Our method is evaluated on large-scale point cloud datasets with different scenarios including indoor and outdoor. The experimental results show the large gain against existing weakly supervised methods and comparable results to fully supervised methods.

IJCAI Conference 2020 Conference Paper

Meta Segmentation Network for Ultra-Resolution Medical Images

Tong Wu
Bicheng Dai
Shuxin Chen
Yanyun Qu
Yuan Xie

Despite recent great progress on semantic segmentation, there still exist huge challenges in medical ultra-resolution image segmentation. The methods based on multi-branch structure can make a good balance between computational burdens and segmentation accuracy. However, the fusion structure in these methods require to be designed elaborately to achieve desirable result, which leads to model redundancy. In this paper, we propose Meta Segmentation Network (MSN) to solve this challenging problem. With the help of meta-learning, the fusion module of MSN is quite simple but effective. MSN can fast generate the weights of fusion layers through a simple meta-learner, requiring only a few training samples and epochs to converge. In addition, to avoid learning all branches from scratch, we further introduce a particular weight sharing mechanism to realize a fast knowledge adaptation and share the weights among multiple branches, resulting in the performance improvement and significant parameters reduction. The experimental results on two challenging ultra-resolution medical datasets BACH and ISIC show that MSN achieves the best performance compared with the state-of-the-art approaches.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Patch Proposal Network for Fast Semantic Segmentation of High-Resolution Images

Tong Wu
Zhenzhen Lei
Bingqian Lin
Cuihua Li
Yanyun Qu
Yuan Xie

Despite recent progress on the segmentation of highresolution images, there exist an unsolved problem, i. e. , the trade-off among the segmentation accuracy, memory resources and inference speed. So far, GLNet is introduced for high or ultra-resolution image segmentation, which has reduced the computational memory of the segmentation network. However, it ignores the importances of different cropped patches, and treats tiled patches equally for fusion with the whole image, resulting in high computational cost. To solve this problem, we introduce a patch proposal network (PPN) in this paper, which adaptively distinguishes the critical patches from the trivial ones to fuse with the whole image for reﬁning segmentation. PPN is a classiﬁcation network which alleviates network training burden and improves segmentation accuracy. We further embed PPN in a globallocal segmentation network, instructing global branch and re- ﬁnement branch to work collaboratively. We implement our method on four image datasets: DeepGlobe, ISIC, CRAG and Cityscapes, the ﬁrst two are ultra-resolution image datasets and the last two are high-resolution image datasets. The experimental results show that our method achieves almost the best segmentation performance compared with the state-ofthe-art segmentation methods and the inference speed is 12. 9 fps on DeepGlobe and 10 fps on ISIC. Moreover, we embed PPN with the general semantic segmentation network and the experimental results on Cityscapes which contains more object classes demonstrate the generalization ability on general semantic segmentation.