Author name cluster

Yanning Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

1 author row

EAAI Journal 2026 Journal Article

Alternating exposure control network for real-world environments

Chenyuan Zhao
Yu Zhu
Qingsen Yan
Jinqiu Sun
Yanning Zhang

Details DOI

JBHI Journal 2026 Journal Article

From Few to More: Scribble-Based Medical Image Segmentation via Masked Context Modeling and Continuous Pseudo Labels

Zhisong Wang
Yiwen Ye
Ziyang Chen
Minglei Shu
Yanning Zhang
Yong Xia

Scribble-based weakly supervised segmentation methods have shown promising results in medical image segmentation, significantly reducing annotation costs. However, existing approaches often rely on auxiliary tasks to enforce semantic consistency and use hard pseudo labels for supervision, overlooking the unique challenges faced by models trained with sparse annotations. These models must predict pixel-wise segmentation maps from limited data, making it crucial to handle varying levels of annotation richness effectively. In this paper, we propose MaCo, a weakly supervised model designed for medical image segmentation, based on the principle of “from few to more. ” MaCo leverages Masked Context Modeling (MCM) and Continuous Pseudo Labels (CPL). MCM employs an attention-based masking strategy to perturb the input image, ensuring that the model’s predictions align with those of the original image. CPL converts scribble annotations into continuous pixel-wise labels by applying an exponential decay function to distance maps, producing confidence maps that represent the likelihood of each pixel belonging to a specific category, rather than relying on hard pseudo labels. We evaluate MaCo on three public datasets, comparing it with other weakly supervised methods. Our results show that MaCo outperforms competing methods across all datasets, establishing a new record in weakly supervised medical image segmentation.

Details DOI

EAAI Journal 2026 Journal Article

Progressive boundary optimisation with cross-knowledge enhancement for arbitrary-shape text detection

Wei Sun
Yaqi Wang
Qianzhou Wang
Xianguang Kong
Yanning Zhang

Details DOI

AAAI Conference 2026 Conference Paper

SOMA: Feature Gradient Enhanced Affine-Flow Matching for SAR-Optical Registration

Haodong Wang
Tao Zhuo
Xiuwei Zhang
Hanlin Yin
Wencong Wu
Yanning Zhang

Achieving pixel-level registration between SAR and optical images remains a challenging task due to their fundamentally different imaging mechanisms and visual characteristics. Although deep learning has achieved great success in many cross-modal tasks, its performance on SAR-Optical registration tasks is still unsatisfactory. Gradient-based information has traditionally played a crucial role in handcrafted descriptors by highlighting structural differences. However, such gradient cues have not been effectively leveraged in deep learning frameworks for SAR-Optical image matching. To address this gap, we propose SOMA, a dense registration framework that integrates structural gradient priors into deep features and refines alignment through a hybrid matching strategy. Specifically, we introduce the Feature Gradient Enhancer (FGE), which embeds multi-scale, multi-directional gradient filters into the feature space using attention and reconstruction mechanisms to boost feature distinctiveness. Furthermore, we propose the Global-Local Affine-Flow Matcher (GLAM), which combines affine transformation and flow-based refinement within a coarse-to-fine architecture to ensure both structural consistency and local accuracy. Experimental results demonstrate that SOMA significantly improves registration precision, increasing the CMR@1px by 12.29% on the SEN1-2 dataset and 18.50% on the GFGE_SO dataset. In addition, SOMA exhibits strong robustness and generalizes well across diverse scenes and resolutions.

PDF Details DOI

AAAI Conference 2026 Conference Paper

YOLO-IOD: Towards Real Time Incremental Object Detection

Shizhou Zhang
Xueqiang Lv
Yinghui Xing
Qirui Wu
Di Xu
Chen Zhao
Yanning Zhang

Current methodologies for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient finetuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importance-based Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3)Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Yating Yu
Congqi Cao
Yueran Zhang
Qinyi Lv
Lingtong Min
Yanning Zhang

Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively constructing an Action Semantic Knowledge Graph (ASKG) to derive nuanced text prompts. The ASKG elaborates on static and dynamic concepts and their interrelations, based on the idea of decomposing actions into spatial appearances and temporal motions. During the training phase, the frame-level video representations are meticulously aligned with prompt-level nuanced text representations, which are concurrently regulated by the video representations from the frozen CLIP to enhance generalizability. Extensive experiments validate the effectiveness of our approach, which consistently surpasses state-of-the-art approaches on popular video benchmarks (i.e., Kinetics-600, UCF101, and HMDB51) under challenging ZSAR settings.

PDF Details DOI

EAAI Journal 2025 Journal Article

Matching quality-guided model-free satellite pose estimation

Zhaoshuai Qi
Yating Liu
Yanning Zhang

Details DOI

JBHI Journal 2025 Journal Article

P2TC: A Lightweight Pyramid Pooling Transformer-CNN Network for Accurate 3D Whole Heart Segmentation

Hengfei Cui
Yifan Wang
Fan Zheng
Yan Li
Yanning Zhang
Yong Xia

Cardiovascular disease is a leading global cause of death, requiring accurate heart segmentation for diagnosis and surgical planning. Deep learning methods have been demonstrated to achieve superior performances in cardiac structures segmentation. However, there are still limitations in 3D whole heart segmentation, such as inadequate spatial context modeling, difficulty in capturing long-distance dependencies, high computational complexity, and limited representation of local high-level semantic information. To tackle the above problems, we propose a lightweight Pyramid Pooling Transformer-CNN (P2TC) network for accurate 3D whole heart segmentation. The proposed architecture comprises a dual encoder-decoder structure with a 3D pyramid pooling Transformer for multi-scale information fusion and a lightweight large-kernel Convolutional Neural Network (CNN) for local feature extraction. The decoder has two branches for precise segmentation and contextual residual handling. The first branch is used to generate segmentation masks for pixel-level classification based on the features extracted by the encoder to achieve accurate segmentation of cardiac structures. The second branch highlights contextual residuals across slices, enabling the network to better handle variations and boundaries. Extensive experimental results on the Multi-Modality Whole Heart Segmentation (MM-WHS) 2017 challenge dataset demonstrate that P2TC outperforms the most advanced methods, achieving the Dice scores of 92. 6% and 88. 1% in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) modalities respectively, which surpasses the baseline model by 1. 5% and 1. 7%, and achieves state-of-the-art segmentation results.

Details DOI

NeurIPS Conference 2025 Conference Paper

PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis

Qing Mao
Tianxin Huang
Yu Zhu
Jinqiu Sun
Yanning Zhang
Gim Hee Lee

Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.

PDF Details

IJCAI Conference 2025 Conference Paper

Prompt-Free Conditional Diffusion for Multi-object Image Augmentation

Haoyu Wang
Lei Zhang
Wei Wei
Chen Ding
Yanning Zhang

Diffusion model has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https: //github. com/00why00/PFCD}{here}.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Training Consistent Mixture-of-Experts-Based Prompt Generator for Continual Learning

Yue Lu
Shizhou Zhang
De Cheng
Guoqiang Liang
Yinghui Xing
Nannan Wang
Yanning Zhang

Visual prompt tuning-based continual learning (CL) methods have shown promising performance in exemplar-free scenarios, where their key component can be viewed as a prompt generator. Existing approaches generally rely on freezing old prompts, slow updating and task discrimination for prompt generators to preserve stability and minimize forgetting. In contrast, we introduce a novel approach that trains a consistent prompt generator to ensure stability during CL. Consistency means that for any instance from an old task, its corresponding instance-ware prompt generated by the prompt generator remains consistent even as the generator continually updates in a new task. This ensures that the representation of a specific instance remains stable across tasks and thereby prevents forgetting. We employ a mixture of experts (MoE) as the prompt generator, which contains a router and multiple experts. By deriving conditions sufficient to achieve the consistency for the MoE prompt generator, we demonstrate that: during training in a new task, if the router and experts update in the directions orthogonal to the subspaces spanned by old input features and gating vectors, respectively, the consistency can be theoretically guaranteed. To implement this orthogonality, we project parameter gradients to those orthogonal directions using the orthogonal projection matrices computed via the null space method. Extensive experiments on four class-incremental learning benchmarks validate the effectiveness and superiority of our approach.

PDF Details DOI

AAAI Conference 2025 Conference Paper

VarCMP: Adapting Cross-Modal Pre-Training Models for Video Anomaly Retrieval

Peng Wu
Wanshun Su
Xiangteng He
Peng Wang
Yanning Zhang

Video anomaly retrieval (VAR) aims to retrieve pertinent abnormal or normal videos from collections of untrimmed and long videos through cross-modal requires such as textual descriptions and synchronized audios. Cross-modal pre-training (CMP) models, by pre-training on large-scale cross-modal pairs, e.g., image and text, can learn the rich associations between different modalities, and this cross-modal association capability gives CMP an advantage in conventional retrieval tasks. Inspired by this, how to utilize the robust cross-modal association capabilities of CMP in VAR to search crucial visual component from these untrimmed and long videos becomes a critical research problem. Therefore, this paper proposes a VAR method based on CMP models, named VarCMP. First, a unified hierarchical alignment strategy is proposed to constrain the semantic and spatial consistency between video and text, as well as the semantic, temporal, and spatial consistency between video and audio. It fully leverages the efficient cross-modal association capabilities of CMP models by considering cross-modal similarities at multiple granularities, enabling VarCMP to achieve effective all-round information matching for both video-text and video-audio VAR tasks. Moreover, to further solve the problem of untrimmed and long video alignment, an anomaly-biased weighting is devised in the fine-grained alignment, which identifies key segments in untrimmed long videos using anomaly priors, giving them more attention, thereby discarding irrelevant segment information, and achieving more accurate matching with cross-modal queries. Extensive experiments demonstrates high efficacy of VarCMP in both video-text and video-audio VAR tasks, achieving significant improvements on both text-video (UCFCrime-AR) and audio-video (XDViolence-AR) datasets against the best competitors by 5.0% and 5.3% R@1.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

Ji Ma
Wei Suo
Peng Wang
Yanning Zhang

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i. e. , “exposure bias” problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

PDF Details DOI

AAAI Conference 2024 Conference Paper

GSDD: Generative Space Dataset Distillation for Image Super-resolution

Haiyu Zhang
Shaolin Su
Yu Zhu
Jinqiu Sun
Yanning Zhang

Single image super-resolution (SISR), especially in the real world, usually builds a large amount of LR-HR image pairs to learn representations that contain rich textural and structural information. However, relying on massive data for model training not only reduces training efficiency, but also causes heavy data storage burdens. In this paper, we attempt a pioneering study on dataset distillation (DD) for SISR problems to explore how data could be slimmed and compressed for the task. Unlike previous coreset selection methods which select a few typical examples directly from the original data, we remove the limitation that the selected data cannot be further edited, and propose to synthesize and optimize samples to preserve more task-useful representations. Concretely, by utilizing pre-trained GANs as a suitable approximation of realistic data distribution, we propose GSDD, which distills data in a latent generative space based on GAN-inversion techniques. By optimizing them to match with the practical data distribution in an informative feature space, the distilled data could then be synthesized. Experimental results demonstrate that when trained with our distilled data, GSDD can achieve comparable performance to the state-of-the-art (SOTA) SISR algorithms, while a nearly ×8 increase in training efficiency and a saving of almost 93.2% data storage space can be realized. Further experiments on challenging real-world data also demonstrate the promising generalization ability of GSDD.

PDF Details DOI

EAAI Journal 2024 Journal Article

Image deraining via invertible disentangled representations

Xueling Chen
Xuan Zhou
Wei Sun
Yanning Zhang

Details DOI

NeurIPS Conference 2024 Conference Paper

Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Fei Zhou
Peng Wang
Lei Zhang
Zhenghua Chen
Wei Wei
Chen Ding
Guosheng Lin
Yanning Zhang

Meta-learning offers a promising avenue for few-shot learning (FSL), enabling models to glean a generalizable feature embedding through episodic training on synthetic FSL tasks in a source domain. Yet, in practical scenarios where the target task diverges from that in the source domain, meta-learning based method is susceptible to over-fitting. To overcome this, we introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning, which is crafted to comprehensively exploit the cross-domain transferable image prior that each image can be decomposed into complementary low-frequency content details and high-frequency robust structural characteristics. Motivated by this insight, we propose to decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network to enhance the final category prediction. More importantly, we introduce a feature reconstruction prior and a prediction consistency prior to separately encourage the consistency of the intermediate feature as well as the final category prediction between the original query image and its decomposed frequency components. This allows for collectively guiding the network's meta-learning process with the aim of learning generalizable image feature embeddings, while not introducing any extra computational cost in the inference phase. Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.

PDF Details DOI

AAAI Conference 2024 Conference Paper

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Peng Wu
Xuerong Zhou
Guansong Pang
Lingru Zhou
Qingsen Yan
Peng Wang
Yanning Zhang

The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Visual Prompt Tuning in Null Space for Continual Learning

Yue Lu
Shizhou Zhang
De Cheng
Yinghui Xing
Nannan Wang
Peng Wang
Yanning Zhang

Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i. e. , 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https: //github. com/zugexiaodui/VPTinNSforCL

PDF Details DOI

JBHI Journal 2023 Journal Article

An Improved Combination of Faster R-CNN and U-Net Network for Accurate Multi-Modality Whole Heart Segmentation

Hengfei Cui
Yifan Wang
Yan Li
Di Xu
Lei Jiang
Yong Xia
Yanning Zhang

Detailed information of substructures of the whole heart is usually vital in the diagnosis of cardiovascular diseases and in 3D modeling of the heart. Deep convolutional neural networks have been demonstrated to achieve state-of-the-art performance in 3D cardiac structures segmentation. However, when dealing with high-resolution 3D data, current methods employing tiling strategies usually degrade segmentation performances due to GPU memory constraints. This work develops a two-stage multi-modality whole heart segmentation strategy, which adopts an improved Combination of Faster R-CNN and 3D U-Net (CFUN+). More specifically, the bounding box of the heart is first detected by Faster R-CNN, and then the original Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) images of the heart aligned with the bounding box are input into 3D U-Net for segmentation. The proposed CFUN+ method redefines the bounding box loss function by replacing the previous Intersection over Union (IoU) loss with Complete Intersection over Union (CIoU) loss. Meanwhile, the integration of the edge loss makes the segmentation results more accurate, and also improves the convergence speed. The proposed method achieves an average Dice score of 91. 1% on the Multi-Modality Whole Heart Segmentation (MM-WHS) 2017 challenge CT dataset, which is 5. 2% higher than the baseline CFUN model, and achieves state-of-the-art segmentation results. In addition, the segmentation speed of a single heart has been dramatically improved from a few minutes to less than 6 seconds.

Details DOI

IJCAI Conference 2023 Conference Paper

Dichotomous Image Segmentation with Frequency Priors

Yan Zhou
Bo Dong
Yuanfeng Wu
Wentao Zhu
Geng Chen
Yanning Zhang

Dichotomous image segmentation (DIS) has a wide range of real-world applications and gained increasing research attention in recent years. In this paper, we propose to tackle DIS with informative frequency priors. Our model, called FP-DIS, stems from the fact that prior knowledge in the frequency domain can provide valuable cues to identify fine-grained object boundaries. Specifically, we propose a frequency prior generator to jointly utilize a fixed filter and learnable filters to extract informative frequency priors. Before embedding the frequency priors into the network, we first harmonize the multi-scale side-out features to reduce their heterogeneity. This is achieved by our feature harmonization module, which is based on a gating mechanism to harmonize the grouped features. Finally, we propose a frequency prior embedding module to embed the frequency priors into multi-scale features through an adaptive modulation strategy. Extensive experiments on the benchmark dataset, DIS5K, demonstrate that our FP-DIS outperforms state-of-the-art methods by a large margin in terms of key evaluation metrics.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Progressive Neighborhood Aggregation for Semantic Segmentation Refinement

Ting Liu
Yunchao Wei
Yanning Zhang

Multi-scale features from backbone networks have been widely applied to recover object details in segmentation tasks. Generally, the multi-level features are fused in a certain manner for further pixel-level dense prediction. Whereas, the spatial structure information is not fully explored, that is similar nearby pixels can be used to complement each other. In this paper, we investigate a progressive neighborhood aggregation (PNA) framework to refine the semantic segmentation prediction, resulting in an end-to-end solution that can perform the coarse prediction and refinement in a unified network. Specifically, we first present a neighborhood aggregation module, the neighborhood similarity matrices for each pixel are estimated on multi-scale features, which are further used to progressively aggregate the high-level feature for recovering the spatial structure. In addition, to further integrate the high-resolution details into the aggregated feature, we apply a self-aggregation module on the low-level features to emphasize important semantic information for complementing losing spatial details. Extensive experiments on five segmentation datasets, including Pascal VOC 2012, CityScapes, COCO-Stuff 10k, DeepGlobe, and Trans10k, demonstrate that the proposed framework can be cascaded into existing segmentation models providing consistent improvements. In particular, our method achieves new state-of-the-art performances on two challenging datasets, DeepGlobe and Trans10k. The code is available at https://github.com/liutinglt/PNA.

PDF Details DOI

AAAI Conference 2023 Conference Paper

See How You Read? Multi-Reading Habits Fusion Reasoning for Multi-Modal Fake News Detection

Lianwei Wu
Pusheng Liu
Yanning Zhang

The existing approaches based on different neural networks automatically capture and fuse the multimodal semantics of news, which have achieved great success for fake news detection. However, they still suffer from the limitations of both shallow fusion of multimodal features and less attention to the inconsistency between different modalities. To overcome them, we propose multi-reading habits fusion reasoning networks (MRHFR) for multi-modal fake news detection. In MRHFR, inspired by people's different reading habits for multimodal news, we summarize three basic cognitive reading habits and put forward cognition-aware fusion layer to learn the dependencies between multimodal features of news, so as to deepen their semantic-level integration. To explore the inconsistency of different modalities of news, we develop coherence constraint reasoning layer from two perspectives, which first measures the semantic consistency between the comments and different modal features of the news, and then probes the semantic deviation caused by unimodal features to the multimodal news content through constraint strategy. Experiments on two public datasets not only demonstrate that MRHFR not only achieves the excellent performance but also provides a new paradigm for capturing inconsistencies between multi-modal news.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

SSML-QNet: Scale-Separative Metric Learning Quadruplet Network for Multi-modal Image Patch Matching

Xiuwei Zhang
Yi Sun
Yamin Han
Yanping Li
Hanlin Yin
Yinghui Xing
Yanning Zhang

Multi-modal image matching is very challenging due to the significant diversities in visual appearance of different modal images. Typically, the existing well-performed methods mainly focus on learning invariant and discriminative features for measuring the relation between multi-modal image pairs. However, these methods often take the features as a whole and largely overlook the fact that different scale features for a same image pair may have different similarity, which may lead to sub-optimal results only. In this work, we propose a Scale-Separative Metric Learning Quadruplet network (SSML-QNet) for multi-modal image patch matching. Specifically, SSML-QNet can extract both relevant and irrelevant features of imaging modality with the proposed quadruplet network architecture. Then, the proposed Scale-Separative Metric Learning module separately encodes the similarity of different scale features with the pyramid structure. And for each scale, cross-modal consistent features are extracted and measured by coordinate and channel-wise attention sequentially. This makes our network robust to appearance divergence caused by different imaging mechanism. Experiments on the benchmark dataset (VIS-NIR, VIS-LWIR, Optical-SAR, and Brown) have verified that the proposed SSML-QNet is able to outperform other state-of-the-art methods. Furthermore, the cross-dataset transferring experiments on these four datasets also have shown that the proposed method has powerful ability of cross-dataset transferring.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Stop-Gradient Softmax Loss for Deep Metric Learning

Lu Yang
Peng Wang
Yanning Zhang

Deep metric learning aims to learn a feature space that models the similarity between images, and feature normalization is a critical step for boosting performance. However directly optimizing L2-normalized softmax loss cause the network to fail to converge. Therefore some SOTA approaches appends a scale layer after the inner product to relieve the convergence problem, but it incurs a new problem that it's difficult to learn the best scaling parameters. In this letter, we look into the characteristic of softmax-based approaches and propose a novel learning objective function Stop-Gradient Softmax Loss (SGSL) to solve the convergence problem in softmax-based deep metric learning with L2-normalization. In addition, we found a useful trick named Remove the last BN-ReLU (RBR). It removes the last BN-ReLU in the backbone to reduce the learning burden of the model. Experimental results on four fine-grained image retrieval benchmarks show that our proposed approach outperforms most existing approaches, i.e., our approach achieves 75.9% on CUB-200-2011, 94.7% on CARS196 and 83.1% on SOP which outperforms other approaches at least 1.7%, 2.9% and 1.7% on Recall@1.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Toward Re-Identifying Any Animal

Bingliang Jiao
Lingqiao Liu
Liying Gao
Ruiqi Wu
Guosheng Lin
Peng Wang
Yanning Zhang

The current state of re-identification (ReID) models poses limitations to their applicability in the open world, as they are primarily designed and trained for specific categories like person or vehicle. In light of the importance of ReID technology for tracking wildlife populations and migration patterns, we propose a new task called ``Re-identify Any Animal in the Wild'' (ReID-AW). This task aims to develop a ReID model capable of handling any unseen wildlife category it encounters. To address this challenge, we have created a comprehensive dataset called Wildlife-71, which includes ReID data from 71 different wildlife categories. This dataset is the first of its kind to encompass multiple object categories in the realm of ReID. Furthermore, we have developed a universal re-identification model named UniReID specifically for the ReID-AW task. To enhance the model's adaptability to the target category, we employ a dynamic prompting mechanism using category-specific visual prompts. These prompts are generated based on knowledge gained from a set of pre-selected images within the target category. Additionally, we leverage explicit semantic knowledge derived from the large-scale pre-trained language model, GPT-4. This allows UniReID to focus on regions that are particularly useful for distinguishing individuals within the target category. Extensive experiments have demonstrated the remarkable generalization capability of our UniReID model. It showcases promising performance in handling arbitrary wildlife categories, offering significant advancements in the field of ReID for wildlife conservation and research purposes.

PDF Details

JBHI Journal 2021 Journal Article

Attention-Guided Deep Neural Network With Multi-Scale Feature Fusion for Liver Vessel Segmentation

Qingsen Yan
Bo Wang
Wei Zhang
Chuan Luo
Wei Xu
Zhengqing Xu
Yanning Zhang
Qinfeng Shi

Liver vessel segmentation is fast becoming a key instrument in the diagnosis and surgical planning of liver diseases. In clinical practice, liver vessels are normally manual annotated by clinicians on each slice of CT images, which is extremely laborious. Several deep learning methods exist for liver vessel segmentation, however, promoting the performance of segmentation remains a major challenge due to the large variations and complex structure of liver vessels. Previous methods mainly using existing UNet architecture, but not all features of the encoder are useful for segmentation and some even cause interferences. To overcome this problem, we propose a novel deep neural network for liver vessel segmentation, called LVSNet, which employs special designs to obtain the accurate structure of the liver vessel. Specifically, we design Attention-Guided Concatenation (AGC) module to adaptively select the useful context features from low-level features guided by high-level features. The proposed AGC module focuses on capturing rich complemented information to obtain more details. In addition, we introduce an innovative multi-scale fusion block by constructing hierarchical residual-like connections within one single residual block, which is of great importance for effectively linking the local blood vessel fragments together. Furthermore, we construct a new dataset containing 40 thin thickness cases (0. 625 mm) which consist of CT volumes and annotated vessels. To evaluate the effectiveness of the method with minor vessels, we also propose an automatic stratification method to split major and minor liver vessels. Extensive experimental results demonstrate that the proposed LVSNet outperforms previous methods on liver vessel segmentation datasets. Additionally, we conduct a series of ablation studies that comprehensively support the superiority of the underlying concepts.

Details DOI

AAAI Conference 2020 Conference Paper

Pixel-Aware Deep Function-Mixture Network for Spectral Super-Resolution

Lei Zhang
Zhiqiang Lang
Peng Wang
Wei Wei
Shengcai Liao
Ling Shao
Yanning Zhang

Spectral super-resolution (SSR) aims at generating a hyperspectral image (HSI) from a given RGB image. Recently, a promising direction is to learn a complicated mapping function from the RGB image to the HSI counterpart using a deep convolutional neural network. This essentially involves mapping the RGB context within a size-speciﬁc receptive ﬁeld centered at each pixel to its spectrum in the HSI. The focus thereon is to appropriately determine the receptive ﬁeld size and establish the mapping function from RGB context to the corresponding spectrum. Due to their differences in category or spatial position, pixels in HSIs often require different-sized receptive ﬁelds and distinct mapping functions. However, few efforts have been invested to explicitly exploit this prior. To address this problem, we propose a pixel-aware deep function-mixture network for SSR, which is composed of a new class of modules, termed function-mixture (FM) blocks. Each FM block is equipped with some basis functions, i. e. , parallel subnets of different-sized receptive ﬁelds. Besides, it incorporates an extra subnet as a mixing function to generate pixel-wise weights, and then linearly mixes the outputs of all basis functions with those generated weights. This enables us to pixel-wisely determine the receptive ﬁeld size and the mapping function. Moreover, we stack several such FM blocks to further increase the ﬂexibility of the network in learning the pixel-wise mapping. To encourage feature reuse, intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental results on three benchmark HSI datasets demonstrate the superiority of the proposed method.

PDF Details

IJCAI Conference 2017 Conference Paper

Dynamic Programming Bipartite Belief Propagation For Hyper Graph Matching

Zhen Zhang
Julian McAuley
Yong Li
Wei Wei
Yanning Zhang
Qinfeng Shi

Hyper graph matching problems have drawn attention recently due to their ability to embed higher order relations between nodes. In this paper, we formulate hyper graph matching problems as constrained MAP inference problems in graphical models. Whereas previous discrete approaches introduce several global correspondence vectors, we introduce only one global correspondence vector, but several local correspondence vectors. This allows us to decompose the problem into a (linear) bipartite matching problem and several belief propagation sub-problems. Bipartite matching can be solved by traditional approaches, while the belief propagation sub-problem is further decomposed as two sub-problems with optimal substructure. Then a newly proposed dynamic programming procedure is used to solve the belief propagation sub-problem. Experiments show that the proposed methods outperform state-of-the-art techniques for hyper graph matching.

PDF Details

AAAI Conference 2017 Conference Paper

MPGL: An Efficient Matching Pursuit Method for Generalized LASSO

Dong Gong
Mingkui Tan
Yanning Zhang
Anton van den Hengel
Qinfeng Shi

Unlike traditional LASSO enforcing sparsity on the variables, Generalized LASSO (GL) enforces sparsity on a linear transformation of the variables, gaining ﬂexibility and success in many applications. However, many existing GL algorithms do not scale up to high-dimensional problems, and/or only work well for a speciﬁc choice of the transformation. We propose an efﬁcient Matching Pursuit Generalized LASSO (MPGL) method, which overcomes these issues, and is guaranteed to converge to a global optimum. We formulate the GL problem as a convex quadratic constrained linear programming (QCLP) problem and tailor-make a cutting plane method. More speciﬁcally, our MPGL iteratively activates a subset of nonzero elements of the transformed variables, and solves a subproblem involving only the activated elements thus gaining signiﬁcant speed-up. Moreover, MPGL is less sensitive to the choice of the trade-off hyper-parameter between data ﬁtting and regularization, and mitigates the longstanding hyper-parameter tuning issue in many existing methods. Experiments demonstrate the superior efﬁciency and accuracy of the proposed method over the state-of-the-arts in both classiﬁcation and image processing tasks.

PDF Details

AAAI Conference 2017 Conference Paper

Solving Constrained Combinatorial Optimisation Problems via MAP Inference without High-Order Penalties

Zhen Zhang
Qinfeng Shi
Julian McAuley
Wei Wei
Yanning Zhang
Rui Yao
Anton van den Hengel

Solving constrained combinatorial optimization problems via MAP inference is often achieved by introducing extra potential functions for each constraint. This can result in very high order potentials, e. g. a 2nd -order objective with pairwise potentials and a quadratic constraint over all N variables would correspond to an unconstrained objective with an order-N potential. This limits the practicality of such an approach, since inference with high order potentials is tractable only for a few special classes of functions. We propose an approach which is able to solve constrained combinatorial problems using belief propagation without increasing the order. For example, in our scheme the 2nd -order problem above remains order 2 instead of order N. Experiments on applications ranging from foreground detection, image reconstruction, quadratic knapsack, and the M-best solutions problem demonstrate the effectiveness and efﬁciency of our method. Moreover, we show several situations in which our approach outperforms commercial solvers like CPLEX and others designed for speciﬁc constrained MAP inference problems.

PDF Details