Author name cluster

Guanbin Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers

2 author rows

AAAI Conference 2026 Conference Paper

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Yuxiang Zhou
Jichang Li
Yanhao Zhang
Haonan Lu
Guanbin Li

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Bridging Knowledge Gap Between Image Inpainting and Large-Area Visible Watermark Removal

Yicheng Leng
Chaowei Fang
Junye Chen
Yixiang Fang
Sheng Li
Guanbin Li

Visible watermark removal which involves watermark cleaning and background content restoration is pivotal to evaluate the resilience of watermarks. Existing deep neural network (DNN)-based models still struggle with large-area watermarks and are overly dependent on the quality of watermark mask prediction. To overcome these challenges, we introduce a novel feature adapting framework that leverages the representation modeling capacity of a pre-trained image inpainting model. Our approach bridges the knowledge gap between image inpainting and watermark removal by fusing information of the residual background content beneath watermarks into the inpainting backbone model. We establish a dual-branch system to capture and embed features from the residual background content, which are merged into intermediate features of the inpainting backbone model via gated feature fusion modules. Moreover, for relieving the dependence on high-quality watermark masks, we introduce a new training paradigm by utilizing coarse watermark masks to guide the inference process. This contributes to a visible image removal model which is insensitive to the quality of watermark mask during testing. Extensive experiments on both a large-scale synthesized dataset and a real-world dataset demonstrate that our approach significantly outperforms existing state-of-the-art methods. The source code is available in the supplementary materials.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

Jiaming Li
Zhijia Liang
Weikai Chen
Lin Ma
Guanbin Li

Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, GUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization.

PDF Details

AAAI Conference 2025 Conference Paper

Hierarchically Controlled Deformable 3D Gaussians for Talking Head Synthesis

Zhenhua Wu
Linxuan Jiang
Xiang Li
Chaowei Fang
Yipeng Qin
Guanbin Li

Audio-driven talking head synthesis is a critical task in digital human modeling. While recent advances using diffusion models and Neural Radiance Fields (NeRF) have improved visual quality, they often require substantial computational resources, limiting practical deployment. We present a novel framework for audio-driven talking head synthesis, namely it Hierarchically Controlled Deformable 3D Gaussians (HiCoDe), which achieves state-of-the-art performance with significantly reduced computational costs. Our key contribution is a hierarchical control strategy that effectively bridges the gap between sparse audio features and dense 3D Gaussian point clouds. Specifically, this strategy comprises two control levels: i) coarse-level control based on a 3D Morphable Model (3DMM) and ii) fine-level control using facial landmarks. Extensive experiments on the HDTF dataset and additional test sets demonstrate that our method outperforms existing approaches in visual quality, facial landmark accuracy, and audio-visual synchronization while being more computationally efficient in both training and inference.

PDF Details DOI

JBHI Journal 2025 Journal Article

Highlighted Diffusion Model as Plug-In Priors for Polyp Segmentation

Yuhao Du
Yuncheng Jiang
Shuangyi Tan
Si-Qi Liu
Zhen Li
Guanbin Li
Xiang Wan

Automated polyp segmentation from colonoscopy images is crucial for colorectal cancer diagnosis. The accuracy of such segmentation, however, is challenged by two main factors. First, the variability in polyps' size, shape, and color, coupled with the scarcity of well-annotated data due to the need for specialized manual annotation, hampers the efficacy of existing deep learning methods. Second, concealed polyps often blend with adjacent intestinal tissues, leading to poor contrast that challenges segmentation models. Recently, diffusion models have been explored and adapted for polyp segmentation tasks. However, the significant domain gap between RGB-colonoscopy images and grayscale segmentation masks, along with the low efficiency of the diffusion generation process, hinders the practical implementation of these models. To mitigate these challenges, we introduce the Highlighted Diffusion Model Plus (HDM+), a two-stage polyp segmentation framework. This framework incorporates the Highlighted Diffusion Model (HDM) to provide explicit semantic guidance, thereby enhancing segmentation accuracy. In the initial stage, the HDM is trained using highlighted ground-truth data, which emphasizes polyp regions while suppressing the background in the images. This approach reduces the domain gap by focusing on the image itself rather than on the segmentation mask. In the subsequent second stage, we employ the highlighted features from the trained HDM's U-Net model as plug-in priors for polyp segmentation, rather than generating highlighted images, thereby increasing efficiency. Extensive experiments conducted on six polyp segmentation benchmarks demonstrate the effectiveness of our approach.

Details DOI

IJCAI Conference 2025 Conference Paper

Pseudo-Label Reconstruction for Partial Multi-Label Learning

Yu Chen
Fang Li
Na Han
Guanbin Li
Hongbo Gao
Sixian Chan
Xiaozhao Fang

In Partial Multi-Label Learning (PML), each instance is associated with a candidate label set containing multiple relevant labels along with other false positive labels. Currently, most PML methods directly extract instance correlation from instance features while ignoring the candidate labels, which may contain more discriminative instance-related information. This paper argues that, with a well-designed model, more accurate instance correlation can be mined from the candidate labels to facilitate label disambiguation. To this end, we propose a novel PML method based on pseudo-label reconstruction (PML-PLR). Specifically, we first propose a novel orthogonal candidate label reconstruction method, which jointly optimizes with instance features to extract more consistent instance correlation. Then, we use instance correlation as reconstruction coefficient to reconstruct pseudo-labels. Subsequently, through local manifold learning, the reconstructed pseudo-labels are leveraged to propagate the consistency relationship between labels and instances, thereby improving the accuracy of pseudo-labels. Extensive experiments and analyses demonstrate that the proposed PML-PLR outperforms state-of-the-art methods.

PDF Details DOI

ICML Conference 2025 Conference Paper

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He
Guangquan Jie
Changshuo Wang 0001
Yun Zhou
Shuming Hu
Guanbin Li
Henghui Ding

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https: //github. com/heshuting555/ReferSplat.

Details

IJCAI Conference 2025 Conference Paper

Screening, Rectifying, and Re-Screening: A Unified Framework for Tuning Vision-Language Models with Noisy Labels

Chaowei Fang
Hangfei Ma
Zhihao Li
De Cheng
Yue Zhang
Guanbin Li

Pre-trained vision-language models have shown remarkable potential for downstream tasks. However, their fine-tuning under noisy labels remains an open problem due to challenges like self-confirmation bias and the limitations of conventional small-loss criteria. In this paper, we propose a unified framework to address these issues, consisting of three key steps: Screening, Rectifying, and Re-Screening. First, a dual-level semantic matching mechanism is introduced to categorize samples into clean, ambiguous, and noisy samples by leveraging both macro-level and micro-level textual prompts. Second, we design tailored pseudo-labeling strategies to rectify noisy and ambiguous labels, enabling their effective incorporation into the training process. Finally, a re-screening step, utilizing cross-validation with an auxiliary vision-language model, mitigates self-confirmation bias and enhances the robustness of the framework. Extensive experiments across ten datasets demonstrate that the proposed method significantly outperforms existing approaches for tuning vision-language pre-trained models with noisy labels.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Cell Graph Transformer for Nuclei Classification

Wei Lou
Guanbin Li
Xiang Wan
Haofeng Li

Nuclei classification is a critical step in computer-aided diagnosis with histopathology images. In the past, various methods have employed graph neural networks (GNN) to analyze cell graphs that model inter-cell relationships by considering nuclei as vertices. However, they are limited by the GNN mechanism that only passes messages among local nodes via fixed edges. To address the issue, we develop a cell graph transformer (CGT) that treats nodes and edges as input tokens to enable learnable adjacency and information exchange among all nodes. Nevertheless, training the transformer with a cell graph presents another challenge. Poorly initialized features can lead to noisy self-attention scores and inferior convergence, particularly when processing the cell graphs with numerous connections. Thus, we further propose a novel topology-aware pretraining method that leverages a graph convolutional network (GCN) to learn a feature extractor. The pre-trained features may suppress unreasonable correlations and hence ease the finetuning of CGT. Experimental results suggest that the proposed cell graph transformer with topology-aware pretraining significantly improves the nuclei classification results, and achieves the state-of-the-art performance. Code and models are available at https://github.com/lhaof/CGT

PDF Details DOI

JBHI Journal 2024 Journal Article

ECC-PolypDet: Enhanced CenterNet With Contrastive Learning for Automatic Polyp Detection

Yuncheng Jiang
Zixun Zhang
Yiwen Hu
Guanbin Li
Xiang Wan
Song Wu
Shuguang Cui
Silin Huang

Accurate polyp detection is critical for early colorectal cancer diagnosis. Although remarkable progress has been achieved in recent years, the complex colon environment and concealed polyps with unclear boundaries still pose severe challenges in this area. Existing methods either involve computationally expensive context aggregation or lack prior modeling of polyps, resulting in poor performance in challenging cases. In this paper, we propose the Enhanced CenterNet with Contrastive Learning (ECC-PolypDet), a two-stage training & end-to-end inference framework that leverages images and bounding box annotations to train a general model and fine-tune it based on the inference score to obtain a final robust model. Specifically, we conduct Box-assisted Contrastive Learning (BCL) during training to minimize the intra-class difference and maximize the inter-class difference between foreground polyps and backgrounds, enabling our model to capture concealed polyps. Moreover, to enhance the recognition of small polyps, we design the Semantic Flow-guided Feature Pyramid Network (SFFPN) to aggregate multi-scale features and the Heatmap Propagation (HP) module to boost the model's attention on polyp targets. In the fine-tuning stage, we introduce the IoU-guided Sample Re-weighting (ISR) mechanism to prioritize hard samples by adaptively adjusting the loss weight for each sample during fine-tuning. Extensive experiments on six large-scale colonoscopy datasets demonstrate the superiority of our model compared with previous state-of-the-art detectors.

Details DOI

AAAI Conference 2024 Conference Paper

FedDiv: Collaborative Noise Filtering for Federated Learning with Noisy Labels

Jichang Li
Guanbin Li
Hui Cheng
Zicheng Liao
Yizhou Yu

Federated Learning with Noisy Labels (F-LNL) aims at seeking an optimal server model via collaborative distributed learning by aggregating multiple client models trained with local noisy or clean samples. On the basis of a federated learning framework, recent advances primarily adopt label noise filtering to separate clean samples from noisy ones on each client, thereby mitigating the negative impact of label noise. However, these prior methods do not learn noise filters by exploiting knowledge across all clients, leading to sub-optimal and inferior noise filtering performance and thus damaging training stability. In this paper, we present FedDiv to tackle the challenges of F-LNL. Specifically, we propose a global noise filter called Federated Noise Filter for effectively identifying samples with noisy labels on every client, thereby raising stability during local training sessions. Without sacrificing data privacy, this is achieved by modeling the global distribution of label noise across all clients. Then, in an effort to make the global model achieve higher performance, we introduce a Predictive Consistency based Sampler to identify more credible local data for local model training, thus preventing noise memorization and further boosting the training stability. Extensive experiments on CIFAR-10, CIFAR-100, and Clothing1M demonstrate that FedDiv achieves superior performance over state-of-the-art F-LNL methods under different label noise settings for both IID and non-IID data partitions. Source code is publicly available at https://github.com/lijichang/FLNL-FedDiv.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal

Yicheng Leng
Chaowei Fang
Gen Li
Yixiang Fang
Guanbin Li

Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a singular branch, leading to residual watermarks in the predictions and ignoring cases where watermarks heavily obscure the background. To address these limitations, this study introduces the Removing Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI embodies a two-stage approach: the initial phase centers on discerning and segregating the watermark component, while the subsequent phase focuses on background content restoration. To achieve meticulous background restoration, our proposed model employs a dual-path network capable of fully exploring the intrinsic background information beneath semi-transparent watermarks and peripheral contextual information from unaffected regions. Moreover, a Global and Local Context Interaction module is built upon multi-layer perceptrons and bidirectional feature transformation for comprehensive representation modeling in the background restoration phase. The efficacy of our approach is empirically validated across two large-scale datasets, and our findings reveal a marked enhancement over existing watermark removal techniques.

PDF Details DOI

AAAI Conference 2024 Conference Paper

UniCell: Universal Cell Nucleus Classification via Prompt Learning

Junjia Huang
Haofeng Li
Xiang Wan
Guanbin Li

The recognition of multi-class cell nuclei can significantly facilitate the process of histopathological diagnosis. Numerous pathological datasets are currently available, but their annotations are inconsistent. Most existing methods require individual training on each dataset to deduce the relevant labels and lack the use of common knowledge across datasets, consequently restricting the quality of recognition. In this paper, we propose a universal cell nucleus classification framework (UniCell), which employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains. In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets. Moreover, we develop a Dynamic Prompt Module (DPM) that exploits the properties of multiple datasets to enhance features. The DPM first integrates the embeddings of datasets and semantic categories, and then employs the integrated prompts to refine image representations, efficiently harvesting the shared knowledge among the related cell types and data sources. Experimental results demonstrate that the proposed method effectively achieves the state-of-the-art results on four nucleus detection and classification benchmarks. Code and models are available at https://github.com/lhaof/UniCell

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

Jiacheng Zhang
Jie Wu
Yuxi Ren
Xin Xia
Huafeng Kuang
Pan Xie
Jiashi Li
Xuefeng Xiao

Latent diffusion models (LDM) have revolutionized text-to-image generation, leading to the proliferation of various advanced models and diverse downstream applications. However, despite these significant advancements, current diffusion models still suffer from several limitations, including inferior visual quality, inadequate aesthetic appeal, and inefficient inference, without a comprehensive solution in sight. To address these challenges, we present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively. UniFL stands out as a universal, effective, and generalizable solution applicable to various diffusion models, such as SD1. 5 and SDXL. Notably, UniFL consists of three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which accelerates inference. In-depth experiments and extensive user studies validate the superior performance of our method in enhancing generation quality and inference acceleration. For instance, UniFL surpasses ImageReward by 17\% user preference in terms of generation quality and outperforms LCM and SDXL Turbo by 57\% and 20\% general preference with 4-step inference.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Variance-Insensitive and Target-Preserving Mask Refinement for Interactive Image Segmentation

Chaowei Fang
Ziyin Zhou
Junye Chen
Hanjing Su
Qingyao Wu
Guanbin Li

Point-based interactive image segmentation can ease the burden of mask annotation in applications such as semantic segmentation and image editing. However, fully extracting the target mask with limited user inputs remains challenging. We introduce a novel method, Variance-Insensitive and Target-Preserving Mask Refinement to enhance segmentation quality with fewer user inputs. Regarding the last segmentation result as the initial mask, an iterative refinement process is commonly employed to continually enhance the initial mask. Nevertheless, conventional techniques suffer from sensitivity to the variance in the initial mask. To circumvent this problem, our proposed method incorporates a mask matching algorithm for ensuring consistent inferences from different types of initial masks. We also introduce a target-aware zooming algorithm to preserve object information during downsampling, balancing efficiency and accuracy. Experiments on GrabCut, Berkeley, SBD, and DAVIS datasets demonstrate our method's state-of-the-art performance in interactive image segmentation.

PDF Details DOI

ICLR Conference 2024 Conference Paper

VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation

Jinxi Xiang
Ricong Huang
Jun Zhang 0018
Guanbin Li
Xiao Han 0011
Yang Wei

Creating stable, controllable videos is a complex task due to the need for significant variation in temporal dynamics and cross-frame temporal consistency. To address this, we enhance the spatial-temporal capability and introduce a versatile video generation model, VersVideo, which leverages textual, visual, and stylistic conditions. Current video diffusion models typically extend image diffusion architectures by supplementing 2D operations (such as convolutions and attentions) with temporal operations. While this approach is efficient, it often restricts spatial-temporal performance due to the oversimplification of standard 3D operations. To counter this, we incorporate two key elements: (1) multi-excitation paths for spatial-temporal convolutions with dimension pooling across different axes, and (2) multi-expert spatial-temporal attention blocks. These enhancements boost the model's spatial-temporal performance without significantly escalating training and inference costs. We also tackle the issue of information loss that arises when a variational autoencoder is used to transform pixel space into latent features and then back into pixel frames. To mitigate this, we incorporate temporal modules into the decoder to maintain inter-frame consistency. Lastly, by utilizing the innovative denoising UNet and decoder, we develop a unified ControlNet model suitable for various conditions, including image, Canny, HED, depth, and style. Examples of the videos generated by our model can be found at https://jinxixiang.github.io/versvideo/.

Details

NeurIPS Conference 2024 Conference Paper

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games

Junlin Xie
Ruifei Zhang
Zhihong Chen
Xiang Wan
Guanbin Li

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs). An LMA is anticipated to execute practical tasks requires various capabilities including multimodal perception, interaction, reasoning, and decision making. However, existing benchmarks are limited in assessing compositional skills and actions demanded by practical scenarios, where they primarily focused on single tasks and static scenarios. To bridge this gap, we introduce WhodunitBench, a benchmark rooted from murder mystery games, where players are required to utilize the aforementioned skills to achieve their objective (i. e. , identifying the `murderer' or hiding themselves), providing a simulated dynamic environment for evaluating LMAs. Specifically, WhodunitBench includes two evaluation modes. The first mode, the arena-style evaluation, is constructed from 50 meticulously curated scripts featuring clear reasoning clues and distinct murderers; The second mode, the chain of evaluation, consists of over 3000 curated multiple-choice questions and open-ended questions, aiming to assess every facet of the murder mystery games for LMAs. Experiments show that although current LMAs show acceptable performance in basic perceptual tasks, they are insufficiently equipped for complex multi-agent collaboration and multi-step reasoning tasks. Furthermore, the full application of the theory of mind to complete games in a manner akin to human behavior remains a significant challenge. We hope this work can illuminate the path forward, providing a solid foundation for the future development of LMAs. Our WhodunitBench is open-source and accessible at: https: //github. com/jun0wanan/WhodunitBench-Murder Mystery Games

PDF Details DOI

AAAI Conference 2023 Conference Paper

Adapting Object Size Variance and Class Imbalance for Semi-supervised Object Detection

Yuxiang Nie
Chaowei Fang
Lechao Cheng
Liang Lin
Guanbin Li

Semi-supervised object detection (SSOD) attracts extensive research interest due to its great significance in reducing the data annotation effort. Collecting high-quality and category-balanced pseudo labels for unlabeled images is critical to addressing the SSOD problem. However, most of the existing pseudo-labeling-based methods depend on a large and fixed threshold to select high-quality pseudo labels from the predictions of a teacher model. Considering different object classes usually have different detection difficulty levels due to scale variance and data distribution imbalance, conventional pseudo-labeling-based methods are arduous to explore the value of unlabeled data sufficiently. To address these issues, we propose an adaptive pseudo labeling strategy, which can assign thresholds to classes with respect to their “hardness”. This is beneficial for ensuring the high quality of easier classes and increasing the quantity of harder classes simultaneously. Besides, label refinement modules are set up based on box jittering for guaranteeing the localization quality of pseudo labels. To further improve the algorithm’s robustness against scale variance and make the most of pseudo labels, we devise a joint feature-level and prediction-level consistency learning pipeline for transferring the information of the teacher model to the student model. Extensive experiments on COCO and VOC datasets indicate that our method achieves state-of-the-art performance. Especially, it brings mean average precision gains of 2.08 and 1.28 on MS-COCO dataset with 5% and 10% labeled images, respectively.

PDF Details DOI

AAAI Conference 2023 Conference Paper

De-biased Teacher: Rethinking IoU Matching for Semi-supervised Object Detection

Kuo Wang
Jingyu Zhuang
Guanbin Li
Chaowei Fang
Lechao Cheng
Liang Lin
Fan Zhou

Most of the recent research in semi-supervised object detection follows the pseudo-labeling paradigm evolved from the semi-supervised image classification task. However, the training paradigm of the two-stage object detector inevitably makes the pseudo-label learning process for unlabeled images full of bias. Specifically, the IoU matching scheme used for selecting and labeling candidate boxes is based on the assumption that the matching source~(ground truth) is accurate enough in terms of the number of objects, object position and object category. Obviously, pseudo-labels generated for unlabeled images cannot satisfy such a strong assumption, which makes the produced training proposals extremely unreliable and thus severely spoil the follow-up training. To de-bias the training proposals generated by the pseudo-label-based IoU matching, we propose a general framework -- De-biased Teacher, which abandons both the IoU matching and pseudo labeling processes by directly generating favorable training proposals for consistency regularization between the weak/strong augmented image pairs. Moreover, a distribution-based refinement scheme is designed to eliminate the scattered class predictions of significantly low values for higher efficiency. Extensive experiments demonstrate that the proposed De-biased Teacher consistently outperforms other state-of-the-art methods on the MS-COCO and PASCAL VOC benchmarks. Source codes are available at https://github.com/wkfdb/De-biased-Teracher.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback

Junfan Lin
Yuying Zhu
Lingbo Liu
Yang Liu
Guanbin Li
Liang Lin

Traffic Signal Control (TSC) aims to reduce the average travel time of vehicles in a road network, which in turn enhances fuel utilization efficiency, air quality, and road safety, benefiting society as a whole. Due to the complexity of long-horizon control and coordination, most prior TSC methods leverage deep reinforcement learning (RL) to search for a control policy and have witnessed great success. However, TSC still faces two significant challenges. 1) The travel time of a vehicle is delayed feedback on the effectiveness of TSC policy at each traffic intersection since it is obtained after the vehicle has left the road network. Although several heuristic reward functions have been proposed as substitutes for travel time, they are usually biased and not leading the policy to improve in the correct direction. 2) The traffic condition of each intersection is influenced by the non-local intersections since vehicles traverse multiple intersections over time. Therefore, the TSC agent is required to leverage both the local observation and the non-local traffic conditions to predict the long-horizontal traffic conditions of each intersection comprehensively. To address these challenges, we propose DenseLight, a novel RL-based TSC method that employs an unbiased reward function to provide dense feedback on policy effectiveness and a non-local enhanced TSC agent to better predict future traffic conditions for more precise traffic control. Extensive experiments and ablation studies demonstrate that DenseLight can consistently outperform advanced baselines on various road networks with diverse traffic flows. The code is available at https: //github. com/junfanlin/DenseLight.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer

Yang Zhang
Lingbo Liu
Xinyu Xiong
Guanbin Li
Guoli Wang
Liang Lin

Wind power is attracting increasing attention around the world due to its renewable, pollution-free, and other advantages. However, safely and stably integrating the high permeability intermittent power energy into electric power systems remains challenging. Accurate wind power forecasting (WPF) can effectively reduce power fluctuations in power system operations. Existing methods are mainly designed for short-term predictions and lack effective spatial-temporal feature augmentation. In this work, we propose a novel end-to-end wind power forecasting model named Hierarchical Spatial-Temporal Transformer Network (HSTTN) to address the long-term WPF problems. Specifically, we construct an hourglass-shaped encoder-decoder framework with skip-connections to jointly model representations aggregated in hierarchical temporal scales, which benefits long-term forecasting. Based on this framework, we capture the inter-scale long-range temporal dependencies and global spatial correlations with two parallel Transformer skeletons and strengthen the intra-scale connections with downsampling and upsampling operations. Moreover, the complementary information from spatial and temporal features is fused and propagated in each other via Contextual Fusion Blocks (CFBs) to promote the prediction further. Extensive experimental results on two large-scale real-world datasets demonstrate the superior performance of our HSTTN over existing solutions.

PDF Details DOI

JBHI Journal 2023 Journal Article

Multi-Task Learning With Hierarchical Guidance for Locating and Stratifying Submucosal Tumors

Ruifei Zhang
Feng Zhang
Si Qin
Dejun Fan
Chaowei Fang
Jie Ma
Xiang Wan
Guanbin Li

Locating and stratifying the submucosal tumor of the digestive tract from endoscopy ultrasound (EUS) images are of vital significance to the preliminary diagnosis of tumors. However, the above problems are challenging, due to the poor appearance contrast between different layers of the digestive tract wall (DTW) and the narrowness of each layer. Few of existing deep-learning based diagnosis algorithms are devised to tackle this issue. In this article, we build a multi-task framework for simultaneously locating and stratifying the submucosal tumor. And considering the awareness of the DTW is critical to the localization and stratification of the tumor, we integrate the DTW segmentation task into the proposed multi-task framework. Except for sharing a common backbone model, the three tasks are explicitly directed with a hierarchical guidance module, in which the probability map of DTW itself is used to locally enhance the feature representation for tumor localization, and the probability maps of DTW and tumor are jointly employed to locally enhance the feature representation for tumor stratification. Moreover, by means of the dynamic class activation map, probability maps of DTW and tumor are reused to enforce the stratification inference process to pay more attention to DTW and tumor regions, contributing to a reliable and interpretable submucosal tumor stratification model. Additionally, considering the relation with respect to other structures is beneficial for stratifying tumors, we devise a graph reasoning module to replenish non-local relation knowledge for the stratification branch. Experiments on a Stomach-Esophagus and an Intestinal EUS dataset prove that our method achieves very appealing performance on both tumor localization and stratification, significantly outperforming state-of-the-art object detection approaches.

Details DOI

AAAI Conference 2022 Conference Paper

A Causal Debiasing Framework for Unsupervised Salient Object Detection

Xiangru Lin
Ziyi Wu
Guanqi Chen
Guanbin Li
Yizhou Yu

Unsupervised Salient Object Detection (USOD) is a promising yet challenging task that aims to learn a salient object detection model without any ground-truth labels. Selfsupervised learning based methods have achieved remarkable success recently and have become the dominant approach in USOD. However, we observed that two distribution biases of salient objects limit further performance improvement of the USOD methods, namely, contrast distribution bias and spatial distribution bias. Concretely, contrast distribution bias is essentially a confounder that makes images with similar high-level semantic contrast and/or low-level visual appearance contrast spuriously dependent, thus forming data-rich contrast clusters and leading the training process biased towards the data-rich contrast clusters in the data. Spatial distribution bias means that the position distribution of all salient objects in a dataset is concentrated on the center of the image plane, which could be harmful to off-center objects prediction. This paper proposes a causal based debiasing framework to disentangle the model from the impact of such biases. Specifically, we use causal intervention to perform deconfounded model training to minimize the contrast distribution bias and propose an image-level weighting strategy that softly weights each image’s importance according to the spatial distribution bias map. Extensive experiments on 6 benchmark datasets show that our method significantly outperforms previous unsupervised state-of-the-art methods and even surpasses some of the supervised methods, demonstrating our debiasing framework’s effectiveness.

PDF Details

AAAI Conference 2022 Conference Paper

A Causal Inference Look at Unsupervised Video Anomaly Detection

Xiangru Lin
Yuyang Chen
Guanbin Li
Yizhou Yu

Unsupervised video anomaly detection, a task that requires no labeled normal/abnormal training data in any form, is challenging yet of great importance to both industrial applications and academic research. Existing methods typically follow an iterative pseudo label generation process. However, they lack a principled analysis of the impact of such pseudo label generation on training. Furthermore, the long-range temporal dependencies also has been overlooked, which is unreasonable since the definition of an abnormal event depends on the longrange temporal context. To this end, first, we propose a causal graph to analyze the confounding effect of the pseudo label generation process. Then, we introduce a simple yet effective causal inference based framework to disentangle the noisy pseudo label’s impact. Finally, we perform counterfactual based model ensemble that blends long-range temporal context with local image context in inference to make final anomaly detection. Extensive experiments on six standard benchmark datasets show that our proposed method significantly outperforms previous state-of-the-art methods, demonstrating our framework’s effectiveness.

PDF Details

NeurIPS Conference 2022 Conference Paper

Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning

Ziyi Zhang
Weikai Chen
Hui Cheng
Zhen Li
Siyuan Li
Liang Lin
Guanbin Li

We investigate a practical domain adaptation task, called source-free domain adaptation (SFUDA), where the source pretrained model is adapted to the target domain without access to the source data. Existing techniques mainly leverage self-supervised pseudo-labeling to achieve class-wise global alignment [1] or rely on local structure extraction that encourages the feature consistency among neighborhoods [2]. While impressive progress has been made, both lines of methods have their own drawbacks – the “global” approach is sensitive to noisy labels while the “local” counterpart suffers from the source bias. In this paper, we present Divide and Contrast (DaC), a new paradigm for SFUDA that strives to connect the good ends of both worlds while bypassing their limitations. Based on the prediction confidence of the source model, DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals under an adaptive contrastive learning framework. Specifically, the source-like samples are utilized for learning global class clustering thanks to their relatively clean labels. The more noisy target-specific data are harnessed at the instance level for learning the intrinsic local structures. We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch. Extensive experiments on VisDA, Office-Home, and the more challenging DomainNet have verified the superior performance of DaC over current state-of-the-art approaches. The code is available at https: //github. com/ZyeZhang/DaC. git.

PDF Details

IJCAI Conference 2022 Conference Paper

Double-Check Soft Teacher for Semi-Supervised Object Detection

Kuo Wang
Yuxiang Nie
Chaowei Fang
Chengzhi Han
Xuewen Wu
Xiaohui Wang Wang
Liang Lin
Fan Zhou

In the semi-supervised object detection task, due to the scarcity of labeled data and the diversity and complexity of objects to be detected, the quality of pseudo-labels generated by existing methods for unlabeled data is relatively low, which severely restricts the performance of semi-supervised object detection. In this paper, we revisit the pseudo-labeling based Teacher-Student mutual learning framework for semi-supervised object detection and identify that the inconsistency of the location and feature of the candidate object proposals between the Teacher and the Student branches are the fatal cause of the low quality of the pseudo labels. To address this issue, we propose a simple yet effective technique within the mainstream teacher-student framework, called Double Check Soft Teacher, to overcome the harm caused by insufficient quality of pseudo labels. Specifically, our proposed method leverages teacher model to generate pseudo labels for the student model. Especially, the candidate boxes generated by the student model based on the pseudo label will be sent to the teacher model for "double check", and then the teacher model will output probabilistic soft label with background class for those candidate boxes, which will be used to train the student model. Together with a pseudo labeling mechanism based on the sum of the TOP-K prediction score, which improves the recall rate of pseudo labels, Double Check Soft Teacher consistently surpasses state-of-the-art methods by significant margins on the MS-COCO benchmark, pushing the new state-of-the-art. Source codes are available at https: //github. com/wkfdb/DCST.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Multi-level Consistency Learning for Semi-supervised Domain Adaptation

Zizheng Yan
Yushuang Wu
Guanbin Li
Yipeng Qin
Xiaoguang Han
Shuguang Cui

Semi-supervised domain adaptation (SSDA) aims to apply knowledge learned from a fully labeled source domain to a scarcely labeled target domain. In this paper, we propose a Multi-level Consistency Learning (MCL) framework for SSDA. Specifically, our MCL regularizes the consistency of different views of target domain samples at three levels: (i) at inter-domain level, we robustly and accurately align the source and target domains using a prototype-based optimal transport method that utilizes the pros and cons of different views of target samples; (ii) at intra-domain level, we facilitate the learning of both discriminative and compact target feature representations by proposing a novel class-wise contrastive clustering loss; (iii) at sample level, we follow standard practice and improve the prediction accuracy by conducting a consistency-based self-training. Empirically, we verified the effectiveness of our MCL framework on three popular SSDA benchmarks, i. e. , VisDA2017, DomainNet, and Office-Home datasets, and the experimental results demonstrate that our MCL framework achieves the state-of-the-art performance.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Unsupervised Domain Adaptive Salient Object Detection through Uncertainty-Aware Pseudo-Label Learning

Pengxiang Yan
Ziyi Wu
Mengmeng Liu
Kun Zeng
Liang Lin
Guanbin Li

Recent advances in deep learning significantly boost the performance of salient object detection (SOD) at the expense of labeling larger-scale per-pixel annotations. To relieve the burden of labor-intensive labeling, deep unsupervised SOD methods have been proposed to exploit noisy labels generated by handcrafted saliency methods. However, it is still difficult to learn accurate saliency details from rough noisy labels. In this paper, we propose to learn saliency from synthetic but clean labels, which naturally has higher pixel-labeling quality without the effort of manual annotations. Specifically, we first construct a novel synthetic SOD dataset by a simple copypaste strategy. Considering the large appearance differences between the synthetic and real-world scenarios, directly training with synthetic data will lead to performance degradation on real-world scenarios. To mitigate this problem, we propose a novel unsupervised domain adaptive SOD method to adapt between these two domains by uncertainty-aware selftraining. Experimental results show that our proposed method outperforms the existing state-of-the-art deep unsupervised SOD methods on several benchmark datasets, and is even comparable to fully-supervised ones.

PDF Details

AAAI Conference 2021 Conference Paper

Multi-Layer Networks for Ensemble Precipitation Forecasts Postprocessing

Fengyang Xu
Guanbin Li
Yunfei Du
Zhiguang Chen
Yutong Lu

The postprocessing method of ensemble forecasts is usually used to find a more precise estimate of future precipitation, because dynamic meteorology models have limitations in fitting fine-grained atmospheric processes and precipitation is driven more often by smaller-scale processes, while ensemble forecasts can hit this precipitation at times. However, the pattern of these hits cannot be easily summarized. The existing objective postprocessing methods tend to extend the rain area or false alarm the precipitation intensity categories. In this work, we introduce a multi-layer structure to simultaneously reduce the bias in forecast ensembles output by meteorology models and merge them to a quality deterministic (single-valued) forecast using cross-grid information, which differs quite dramatically from the previous statistical postprocessing method. The multi-layer network is designed to model the spatial distribution of future precipitation of different intensity categories (IC-MLNet). We provide a comparison of IC-MLNet to simple average as well as another two state-of-the-art ensemble quantitative precipitation forecasts (QPFs) postprocessing approaches over both single-model and multi-model ensemble forecasts datasets from TIGGE. The experimental results indicate that our model achieves superior performance over the compared baselines in precipitation amount prediction as well as precipitation intensities categories prediction.

PDF Details

IJCAI Conference 2021 Conference Paper

Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video

Jie Wu
Wei Zhang
Guanbin Li
Wenhao Wu
Xiao Tan
Yingying Li
Errui Ding
Liang Lin

In this paper, we introduce a novel task, referred to as Weakly-Supervised Spatio-Temporal Anomaly Detection (WSSTAD) in surveillance video. Specifically, given an untrimmed video, WSSTAD aims to localize a spatio-temporal tube (i. e. , a sequence of bounding boxes at consecutive times) that encloses the abnormal event, with only coarse video-level annotations as supervision during training. To address this challenging task, we propose a dual-branch network which takes as input the proposals with multi-granularities in both spatial-temporal domains. Each branch employs a relationship reasoning module to capture the correlation between tubes/videolets, which can provide rich contextual information and complex entity relationships for the concept learning of abnormal behaviors. Mutually-guided Progressive Refinement framework is set up to employ dual-path mutual guidance in a recurrent manner, iteratively sharing auxiliary supervision information across branches. It impels the learned concepts of each branch to serve as a guide for its counterpart, which progressively refines the corresponding branch and the whole framework. Furthermore, we contribute two datasets, i. e. , ST-UCF-Crime and STRA, consisting of videos containing spatio-temporal abnormal annotations to serve as the benchmarks for WSSTAD. We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to handle this task.

PDF Details DOI

AAAI Conference 2020 Conference Paper

An Adversarial Perturbation Oriented Domain Adaptation Approach for Semantic Segmentation

Jihan Yang
Ruijia Xu
Ruiyu Li
Xiaojuan Qi
Xiaoyong Shen
Guanbin Li
Liang Lin

We focus on Unsupervised Domain Adaptation (UDA) for the task of semantic segmentation. Recently, adversarial alignment has been widely adopted to match the marginal distribution of feature representations across two domains globally. However, this strategy fails in adapting the representations of the tail classes or small objects for semantic segmentation since the alignment objective is dominated by head categories or large objects. In contrast to adversarial alignment, we propose to explicitly train a domain-invariant classiﬁer by generating and defensing against pointwise feature space adversarial perturbations. Speciﬁcally, we ﬁrstly perturb the intermediate feature maps with several attack objectives (i. e. , discriminator and classiﬁer) on each individual position for both domains, and then the classiﬁer is trained to be invariant to the perturbations. By perturbing each position individually, our model treats each location evenly regardless of the category or object size and thus circumvents the aforementioned issue. Moreover, the domain gap in feature space is reduced by extrapolating source and target perturbed features towards each other with attack on the domain discriminator. Our approach achieves the state-of-the-art performance on two challenging domain adaptation tasks for semantic segmentation: GTA5 → Cityscapes and SYNTHIA → Cityscapes.

PDF Details

AAAI Conference 2020 Conference Paper

Knowledge Graph Transfer Network for Few-Shot Recognition

Riquan Chen
Tianshui Chen
Xiaolu Hui
Hefeng Wu
Guanbin Li
Liang Lin

Few-shot learning aims to learn novel categories from very few samples given some base categories with sufﬁcient training samples. The main challenge of this task is the novel categories are prone to dominated by color, texture, shape of the object or background context (namely speciﬁcity), which are distinct for the given few training samples but not common for the corresponding categories (see Figure 1). Fortunately, we ﬁnd that transferring information of the correlated based categories can help learn the novel concepts and thus avoid the novel concept being dominated by the speciﬁcity. Besides, incorporating semantic correlations among different categories can effectively regularize this information transfer. In this work, we represent the semantic correlations in the form of structured knowledge graph and integrate this graph into deep neural networks to promote few-shot learning by a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing each node with the classiﬁer weight of the corresponding category, a propagation mechanism is learned to adaptively propagate node message through the graph to explore node interaction and transfer classiﬁer information of the base categories to those of the novel ones. Extensive experiments on the ImageNet dataset show significant performance improvement compared with current leading competitors. Furthermore, we construct an ImageNet-6K dataset that covers larger scale categories, i. e, 6, 000 categories, and experiments on this dataset further demonstrate the effectiveness of our proposed model.

PDF Details

ECAI Conference 2020 Conference Paper

MetaSelection: Metaheuristic Sub-Structure Selection for Neural Network Pruning Using Evolutionary Algorithm

Zixun Zhang
Zhen Li 0026
Lin Lin 0008
Na Lei
Guanbin Li
Shuguang Cui

Neural network pruning is widely applied to various mobile applications. Previous pruning methods mainly leverage ad-hoc criteria to evaluate channel importance. In this paper, we propose an effective metaheuristic sub-structure selection (MetaSelection) method for neural network pruning. MetaSelection exploits evolutionary algorithm (EA) to search the proper sub-structure satisfying the resource constraints. In comparison with previous AutoML based methods, MetaSelection can automatically achieve the pruning rate and channel selection at the same time instead of hand-crafted criteria in a cascaded way. Regarding the tremendous search space of channel selection as a combinatorial optimization problem, we further utilize a coarse-to-fine strategy and the novel probability distribution crossover (PDC) to speed up the search procedure. Besides, MetaSelection prunes the network globally rather than in a layer-by-layer way. We evaluate MetaSelection on several appealing deep neural networks, achieving superior results with adaptive depth and width. Concretely, on ImageNet, MetaSelection achieves a top-1 accuracy of 71. 5% on MobileNetV2 under 70% FLOPs constraint and a FLOPs reduction of 30% with 76. 4% top-1 accuracy for ResNet50.

Details

AAAI Conference 2020 Conference Paper

Tree-Structured Policy Based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Jie Wu
Guanbin Li
Si Liu
Liang Lin

Temporally language grounding in untrimmed videos is a newly-raised task in video understanding. Most of the existing methods suffer from inferior efﬁciency, lacking interpretability, and deviating from the human perception mechanism. Inspired by human’s coarse-to-ﬁne decision-making paradigm, we formulate a novel Tree-Structured Policy based Progressive Reinforcement Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by an iterative reﬁnement process. The semantic concepts are explicitly represented as the branches in the policy, which contributes to efﬁciently decomposing complex policies into an interpretable primitive action. Progressive reinforcement learning provides correct credit assignment via two task-oriented rewards that encourage mutual promotion within the treestructured policy. We extensively evaluate TSP-PRL on the Charades-STA and ActivityNet datasets, and experimental results show that TSP-PRL achieves competitive performance over existing state-of-the-art methods.

PDF Details

AAAI Conference 2019 Conference Paper

FRAME Revisited: An Interpretation View Based on Particle Evolution

Xu Cai
Yang Wu
Guanbin Li
Ziliang Chen
Liang Lin

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model’s statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.

PDF Details

ICRA Conference 2019 Conference Paper

Lightweight Contrast Modeling for Attention-Aware Visual Localization

Lili Huang 0004
Guanbin Li
Ya Li
Liang Lin

Salient object detection, which aims at localizing the attention-aware visual objects, is the indispensable technology for intelligent robots to understand and interact with the complicated environments. Existing salient object detection approaches mainly focus on the optimization of detection performance, while ignoring the considerations for computational resource consumption and algorithm efficiency. Contrarily, we build a superior lightweight network architecture to simultaneously improve performance on both accuracy and efficiency for salient object detection. Specifically, our proposed approach adopts the lightweight bottleneck as its primary building block to significantly reduce the number of parameters and to speed up the process of training and inference. In practice, the visual contrast is insufficiently discovered with the limitation of the small empirical receptive field of CNN. To alleviate this issue, we design a multi-scale convolution module to rapidly discover high-level visual contrast. Moreover, a lightweight refinement module is utilized to restore object saliency details with negligible extra cost. Extensive experiments on efficiency and accuracy trade-offs show that our model is more competitive than the state-of-the-art works on salient object detection task and has prominent potentials for robots applications in real time.

Details

ICML Conference 2019 Conference Paper

Multivariate-Information Adversarial Ensemble for Scalable Joint Distribution Matching

Ziliang Chen 0001
Zhanfu Yang
Xiaoxi Wang
Xiaodan Liang
Xiaopeng Yan
Guanbin Li
Liang Lin

A broad range of cross-$m$-domain generation researches boil down to matching a joint distribution by deep generative models (DGMs). Hitherto algorithms excel in pairwise domains while as $m$ increases, remain struggling to scale themselves to ﬁt a joint distribution. In this paper, we propose a domain-scalable DGM, i. e. , MMI-ALI for $m$-domain joint distribution matching. As an $m$-domain ensemble model of ALIs (Dumoulin et al. , 2016), MMI-ALI is adversarially trained with maximizing Multivariate Mutual Information (MMI) w. r. t. joint variables of each pair of domains and their shared feature. The negative MMIs are upper bounded by a series of feasible losses provably leading to matching $m$-domain joint distributions. MMI-ALI linearly scales as $m$ increases and thus, strikes a right balance between efﬁcacy and scalability. We evaluate MMI-ALI in diverse challenging $m$-domain scenarios and verify its superiority.

Details

AAAI Conference 2019 Conference Paper

Non-Local Context Encoder: Robust Biomedical Image Segmentation against Adversarial Attacks

Xiang He
Sibei Yang
Guanbin Li
Haofeng Li
Huiyou Chang
Yizhou Yu

Recent progress in biomedical image segmentation based on deep convolutional neural networks (CNNs) has drawn much attention. However, its vulnerability towards adversarial samples cannot be overlooked. This paper is the first one that discovers that all the CNN-based state-of-the-art biomedical image segmentation models are sensitive to adversarial perturbations. This limits the deployment of these methods in safety-critical biomedical fields. In this paper, we discover that global spatial dependencies and global contextual information in a biomedical image can be exploited to defend against adversarial attacks. To this end, non-local context encoder (NLCE) is proposed to model short- and longrange spatial dependencies and encode global contexts for strengthening feature activations by channel-wise attention. The NLCE modules enhance the robustness and accuracy of the non-local context encoding network (NLCEN), which learns robust enhanced pyramid feature representations with NLCE modules, and then integrates the information across different levels. Experiments on both lung and skin lesion segmentation datasets have demonstrated that NLCEN outperforms any other state-of-the-art biomedical image segmentation methods against adversarial attacks. In addition, NLCE modules can be applied to improve the robustness of other CNN-based biomedical image segmentation methods.

PDF Details

AAAI Conference 2019 Conference Paper

Semantic Relationships Guided Representation Learning for Facial Action Unit Recognition

Guanbin Li
Xin Zhu
Yirui Zeng
Qing Wang
Liang Lin

Facial action unit (AU) recognition is a crucial task for facial expressions analysis and has attracted extensive attention in the field of artificial intelligence and computer vision. Existing works have either focused on designing or learning complex regional feature representations, or delved into various types of AU relationship modeling. Albeit with varying degrees of progress, it is still arduous for existing methods to handle complex situations. In this paper, we investigate how to integrate the semantic relationship propagation between AUs in a deep neural network framework to enhance the feature representation of facial regions, and propose an AU semantic relationship embedded representation learning (SRERL) framework. Specifically, by analyzing the symbiosis and mutual exclusion of AUs in various facial expressions, we organize the facial AUs in the form of structured knowledge-graph and integrate a Gated Graph Neural Network (GGNN) in a multi-scale CNN framework to propagate node information through the graph for generating enhanced AU representation. As the learned feature involves both the appearance characteristics and the AU relationship reasoning, the proposed model is more robust and can cope with more challenging cases, e. g. , illumination change and partial occlusion. Extensive experiments on the two public benchmarks demonstrate that our method outperforms the previous work and achieves state of the art performance.

PDF Details

IJCAI Conference 2018 Conference Paper

Crowd Counting using Deep Recurrent Spatial-Aware Network

Lingbo Liu
Hongjun Wang
Guanbin Li
Wanli Ouyang
Liang Lin

Crowd counting from unconstrained scene images is a crucial task in many real-world applications like urban surveillance and management, but it is greatly challenged by the camera’s perspective that causes huge appearance variations in people’s scales and rotations. Conventional methods address such challenges by resorting to fixed multi-scale architectures that are often unable to cover the largely varied scales while ignoring the rotation variations. In this paper, we propose a unified neural network framework, named Deep Recurrent Spatial-Aware Network, which adaptively addresses the two issues in a learnable spatial transform module with a region-wise refinement process. Specifically, our framework incorporates a Recurrent Spatial-Aware Refinement (RSAR) module iteratively conducting two components: i) a Spatial Transformer Network that dynamically locates an attentional region from the crowd density map and transforms it to the suitable scale and rotation for optimal crowd estimation; ii) a Local Refinement Network that refines the density map of the attended region with residual learning. Extensive experiments on four challenging benchmarks show the effectiveness of our approach. Specifically, comparing with the existing best-performing methods, we achieve an improvement of 12\% on the largest dataset WorldExpo’10 and 22. 8\% on the most challenging dataset UCF\_CC\_50

PDF Details

AAAI Conference 2018 Conference Paper

Recurrent Attentional Reinforcement Learning for Multi-Label Image Recognition

Tianshui Chen
Zhouxia Wang
Guanbin Li
Liang Lin

Recognizing multiple labels of images is a fundamental but challenging task in computer vision, and remarkable progress has been attained by localizing semantic-aware image regions and predicting their labels with deep convolutional neural networks. The step of hypothesis regions (region proposals) localization in these existing multi-label image recognition pipelines, however, usually takes redundant computation cost, e. g. , generating hundreds of meaningless proposals with nondiscriminative information and extracting their features, and the spatial contextual dependency modeling among the localized regions are often ignored or over-simpliﬁed. To resolve these issues, this paper proposes a recurrent attention reinforcement learning framework to iteratively discover a sequence of attentional and informative regions that are related to different semantic objects and further predict label scores conditioned on these regions. Besides, our method explicitly models longterm dependencies among these attentional regions that help to capture semantic label co-occurrence and thus facilitate multilabel recognition. Extensive experiments and comparisons on two large-scale benchmarks (i. e. , PASCAL VOC and MS- COCO) show that our model achieves superior performance over existing state-of-the-art methods in both performance and efﬁciency as well as explicitly identifying image-level semantic labels to speciﬁc object regions.

PDF Details

AAAI Conference 2018 Conference Paper

Weakly Supervised Salient Object Detection Using Image Labels

Guanbin Li
Yuan Xie
Liang Lin

Deep learning based salient object detection has recently achieved great success with its performance greatly outperforms any other unsupervised methods. However, annotating per-pixel saliency masks is a tedious and inefﬁcient procedure. In this paper, we note that superior salient object detection can be obtained by iteratively mining and correcting the labeling ambiguity on saliency maps from traditional unsupervised methods. We propose to use the combination of a coarse salient object activation map from the classiﬁcation network and saliency maps generated from unsupervised methods as pixel-level annotation, and develop a simple yet very effective algorithm to train fully convolutional networks for salient object detection supervised by these noisy annotations. Our algorithm is based on alternately exploiting a graphical model and training a fully convolutional network for model updating. The graphical model corrects the internal labeling ambiguity through spatial consistency and structure preserving while the fully convolutional network helps to correct the cross-image semantic ambiguity and simultaneously update the coarse activation map for next iteration. Experimental results demonstrate that our proposed method greatly outperforms all state-of-the-art unsupervised saliency detection methods and can be comparable to the current best strongly-supervised methods training with thousands of pixel-level saliency map annotations on all public benchmarks.

PDF Details